mideind / greynirengine Goto Github PK
View Code? Open in Web Editor NEWA fast, efficient natural language processing engine for Icelandic.
Home Page: https://greynir.is
License: Other
A fast, efficient natural language processing engine for Icelandic.
Home Page: https://greynir.is
License: Other
Greynir makes it easy to lemmatize text. If the parser fails I can fallback to the bintokenizer and get multiple lemmas for all meanings. This makes for a great search index even if there are some extra lemmas there when the parser fails.
Perhaps Greynir should provide a function out of the box to do this, as it will be a common use case? I can share my code if anyone wants to see it.
I ran the 100 most common first names in Iceland through greynir.parse
. No female names are interpreted as verbs but there are a few male ones. See this gist for the code.
https://gist.github.com/jokull/2c1048bbc845feb46c717ac7c77e0cc5
If there is a way to augment the grammar file for specific project contexts, that should be documented.
I would loooove it if were easier to reach other variant cases and number when you have a token meaning or terminal instance. Something like token.get_singular
and token.get_accusative
.
Using the example code from https://greynir.is/doc/quickstart.html and the text from ruv.is:
my_text = "Ákæran var þingfest í Héraðsdómi Reykjaness í dag en fréttastofu er ekki kunnugt um hvort maðurinn játaði eða neitaði sök þar sem þinghaldið í málinu er lokað."
Gives the exception in the title.
Using the example code from https://greynir.is/doc/quickstart.html and the text from ruv.is:
my_text = "Viðar Garðarsson , sem setti upp vefsíður fyrir Sigmund Davíð Gunnlaugsson í kjölfar birtingu Panamaskjalanna , segist ekki vita hvers vegna ákveðið var að segja að vefjunum væri haldið úti af stuðningsmönnum Sigmundar"
Gives the exception in the title.
Reynir doesn't identify alphabetic characters appended to house numbers as part of the address.
>>> s = r.parse_single("Hann býr á Bárugötu 14.")
>>> print(s.tree.view)
P
+-S-MAIN
+-IP
+-NP-SUBJ
+-pfn_kk_et_nf: 'Hann'
+-VP-SEQ
+-VP
+-so_0_et_p3: 'býr'
+-PP
+-fs_þgf: 'á'
+-NP-ADDR
+-gata_þgf_kvk: 'Bárugötu'
+-tala: '14'
+-'.'
>>> s = r.parse_single("Hann býr á Bárugötu 14b.")
>>> print(s.tree.view)
P
+-S-MAIN
+-IP
+-NP-SUBJ
+-pfn_kk_et_nf: 'Hann'
+-VP-SEQ
+-VP
+-so_0_et_p3: 'býr'
+-PP
+-fs_þgf: 'á'
+-NP
+-sérnafn: 'Bárugötu'
+-tala: '14'
+-no_et_þgf_hk: 'b'
+-'.'
>>> s = r.parse_single("Hann býr á Bárugötu 14A.")
>>> print(s.tree.view)
P
+-S-MAIN
+-IP
+-NP-SUBJ
+-pfn_kk_et_nf: 'Hann'
+-VP-SEQ
+-VP
+-so_0_et_p3: 'býr'
+-PP
+-fs_þgf: 'á'
+-NP
+-no_et_þgf_kvk: 'Bárugötu'
+-NP-POSS
+-NP-MEASURE
+-mælieining: '14 A'
+-'.'
Thank you for the great project!
While using the python package, I ran into a bug with a wrong word being returned. The nominative case of Þór in "Ragnar Þór Valgeirsson" is wrong when using a NounPhrase.
greynir_bug.py
from reynir import NounPhrase as Nl
nafn = Nl("Ragnar Þór Valgeirsson")
print(f"{nafn:nf}")
Results after executing ten times
for i in {1..10}; do python greynir_bug.py; done
Ragnar Þórr Valgeirsson
Ragnar Þórr Valgeirsson
Ragnar Þór Valgeirsson
Ragnar Þórr Valgeirsson
Ragnar Þórr Valgeirsson
Ragnar Þór Valgeirsson
Ragnar Þórr Valgeirsson
Ragnar Þór Valgeirsson
Ragnar Þór Valgeirsson
Ragnar Þórr Valgeirsson
I’m not familiar with the parsing pipeline but I thought I would share an instance of where the parser tripped in a (to me) surprising way:
greynir.parse_single('Sótt er um leyfi til að byggja 50 leiguíbúðir fyrir námsme
nn á lóð við Austurhlíð.').lemmas
This is fine and gives me the right lemmas for leiguíbúð, námsmaður etc.
greynir.parse_single('Sótt er um leyfi til að byggja 50 leiguíbúðir fyrir námsme
nn, á lóð við Austurhlíð.').lemmas
The comma before "á lóð" gives me the "eiga" lemma for "á" instead of just "á".
Sorry if GitHub issues is the wrong place. I’m mainly curious about the roadmap, design and limitations. I assume Greynir uses commas to fragment sentences to keep down the parse pathways.
BTW this is a real world example.
Loving Greynir and following your progress! ✨
EDIT: Screenshot might help
I’m writing some code to detect mentions of companies. The corpus uses the ehf/of/sf/etc. suffixes so that’s a strong indicator for me, and potentially for Greynir too.
I know that the Greynir website has an entity recognizer, but it seems quite strongly coupled to the database. Is there a case for bintokenizer to adapt a new token type? Or perhaps for Greynir to become company-entity aware?
I have some interesting examples of company names if that’s useful. I’m currently using an imperfect regex to match company names and then using Greynir to go back to the indefinite form.
These are the suffixes I’ve come across:
from reynir import NounPhrase
np_1 = NounPhrase('ýmsir menn, þar á meðal þessi')
print(f'Ég er í slagtogi með {np_1:þgf}.')
# Output: Ég er í slagtogi með ýmsum mönnum, þar á meðal þessum.
# Expected: Same as output
np_2 = NounPhrase('ýmsir menn, til dæmis þessi')
print(f'Ég er í slagtogi með {np_2:þgf}.')
# Output: Ég er í slagtogi með ýmsir menn, til dæmis þessi.
# Expected: Ég er í slagtogi með ýmsum mönnum, til dæmis þessum.
np_3 = NounPhrase('ýmsir menn, t.d. þessi')
print(f'Ég er í slagtogi með {np_3:þgf}.')
# Output: Ég er í slagtogi með ýmsir menn, t.d. þessi.
# Expected: Ég er í slagtogi með ýmsum mönnum, t.d. þessum.
np_4 = NounPhrase('ýmsir menn, þ.á m. þessi')
print(f'Ég er í slagtogi með {np_4:þgf}.')
# Output: Ég er í slagtogi með ýmsir menn, þ.á m. þessi.
# Expected: Ég er í slagtogi með ýmsum mönnum, þ.á m. þessum.
Parsing leads to extreme memory requirements when it encounters long sentences. Perhaps it should be documented that 8GB is the mimum memory and that 100 tokens is the maximum sentence fragment length?
Using version 3.5.5, consider the following ipython
log.
In [15]: from reynir import NounPhrase
In [16]: name = NounPhrase("skipulags- og byggingarlög")
In [17]: name
Out[17]: <reynir.NounPhrase('skipulags- og byggingarlög'), parsed>
In [18]: name.dative
Out[18]: 'skipulags- og byggingarlög'
In [19]: name.genitive
Out[19]: 'skipulags- og byggingarlög'
The dative is expected to be "skipulags- og byggingarlögum" and the genitive is expected to be "skipulags- og byggingarlaga".
Is there a way to get verb form variants in the same way you can get case variants for nouns?
Something like
>>> BIN_Db.lookup_past_participle("sækja")
The use case is for a results highlighter. Lemmas are indexed, but I would like to highlight the original forms based on search string lemmas. For this I need to potentially highlight derived word forms. I’m basically writing a get_all_meaning_wordforms
function that returns a set of strings that should be highlighted.
Tokenizing the phrase 'hver er hæð sólar í dag í reykjavík' with auto_uppercase=True
results in 'Sólar Í Dag Í Reykjavík' being interpreted as a single name token (TOK.PERSON
).
This also happens when 'í reykjavík' is omitted (resulting in 'Sólar Í Dag').
('hæð' probably shouldn't be capitalized either in this case).
BÍN has POS tags such as present and past tense (NT
, ÞT
), and attached definite article (gr
), which are not always returned in Reynir's variant lists since they are not significant for the parse as such. However this may well be useful information for Reynir clients. An augmentation feature should be added to Reynir that adds any significant missing BÍN tags to terminal variants before they are returned from Reynir, for instance in the _Sentence.terminals
property.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.