mideind / greynircorrect Goto Github PK
View Code? Open in Web Editor NEWSpelling and grammar correction for Icelandic
License: Other
Spelling and grammar correction for Icelandic
License: Other
I want to use Greynir-Correct for correction of non-whole sentences, i.e. in extreme cases single words. What method or options should I use to make that possible ?
Currently, when using the tokenize()
method with option only_ci=True
, it complains about the following:
Maðurin Z002 Orð á að byrja á hástaf: 'maðurin'
Maðurinn Z002 Orð á að byrja á hástaf: 'maðurinn'
Sample code:
from reynir_correct import tokenize
texts = ["maðurin", "maðurinn" ]
for t in texts:
g = tokenize(t, only_ci=True)
for t in g:
if t.txt:
print(f"{t.txt:12} {t.error_code:8} {t.error_description}")
Hæ,
Ég var að spá í hvaða tilfellum er þessi listi ekki tómur:
GreynirCorrect/src/reynir_correct/main.py
Line 191 in deec51e
Ég er reyndar að vinna með þetta án þess að nota gen fallið heldur vinnur bara með textan beint sem streng.
Ég var líka að prenta út TOK.END og fæ frozenset({10000, 12001, 10002, 11002}), í hvaða charactera er verið að vísa hérna?
Ef ég geri t.d. print(chr(10000))
fæ ég bara mjög skrítin tákn.
I have a open issue in Yfirlestur but it's probably more appropriate for GreynirCorrect, creating here for visability.
Issue in Yfirlestur
If we take the text Hann vil
for example. GreynirCorrect will give two suggestion, the latter one being the same as the original input. What appears to be happening is that the latter suggestion is based on the input being the first suggestion, instead of being based off the original input Hann vil
.
As a consequence I get this cyclical suggestion: Hann vil
-> Hann vill
-> Hann vil
... There is no resolution for the word vil / vill
Response given by Yfirlestur for text Hann vil
{
"result": [
[
{
"annotations": [
{
"code": "P_wrong_person",
"detail": null,
"end": 1,
"end_char": 7,
"references": [],
"start": 0,
"start_char": 0,
"suggest": "Hann vill",
"suggestlist": null,
"text": "Orðasambandið 'Hann vil' var leiðrétt í 'Hann vill'"
},
{
"code": "BEYGVILLA",
"detail": "Beygingarmyndin 'vill' er ekki í samræmi við málvenju, 'vil' er ákjósanlegra.",
"end": 1,
"end_char": 7,
"references": [],
"start": 1,
"start_char": 4,
"suggest": "vil",
"suggestlist": null,
"text": "Beygingarvilla: 'vill' -> 'vil'"
}
],
"corrected": "Hann vil",
"nonce": "41903140",
"original": "Hann vil",
"token": "458f66a39f679f710e313e3d1e456e0971abd7405453b32543e47048d4351b2d",
"tokens": [
{
"i": 0,
"k": 6,
"o": "Hann",
"x": "Hann"
},
{
"i": 4,
"k": 6,
"o": " vil",
"x": "vil"
}
]
}
]
],
"stats": {
"ambiguity": 1.0,
"num_chars": 8,
"num_parsed": 1,
"num_sentences": 1,
"num_tokens": 2
},
"text": "Hann vil",
"valid": true
}
If I use the first suggestion Hann vill
and call this service again with my new string Hann vill
I will get this suggestion (basically the latter suggestion again).
Response given by Yfirlestur for text Hann vill
{
"result": [
[
{
"annotations": [
{
"code": "BEYGVILLA",
"detail": "Beygingarmyndin 'vill' er ekki í samræmi við málvenju, 'vil' er ákjósanlegra.",
"end": 1,
"end_char": 8,
"references": [],
"start": 1,
"start_char": 4,
"suggest": "vil",
"suggestlist": null,
"text": "Beygingarvilla: 'vill' -> 'vil'"
}
],
"corrected": "Hann vil",
"nonce": "28078813",
"original": "Hann vill",
"token": "8d2b53caad5b029b1064172be9ca776a6c0b7b539af3e6b668973c937433ea7c",
"tokens": [
{
"i": 0,
"k": 6,
"o": "Hann",
"x": "Hann"
},
{
"i": 4,
"k": 6,
"o": " vill",
"x": "vil"
}
]
}
]
],
"stats": {
"ambiguity": 1.0,
"num_chars": 9,
"num_parsed": 1,
"num_sentences": 1,
"num_tokens": 2
},
"text": "Hann vill",
"valid": true
}
This is currently possible by passing "Z002" to ignore_rules
but not without losing more functionality.
Having different error codes for words that should start with an uppercase because they are the first word of a sentence and words that should always be written in uppercase e.g. Reykjavík would be useful.
If there are multiple possibilities to correct a word, e.g. from POV of the same Levenshtein distance, is it possible to get back all possible corrections greynir-correct knows ? If yes: is there a possibility to even get back a ranked set of corrections ?
For example:
billinn er a kassanum
Could be corrected to billinn er
á
kassanum
or billinn er
í
kassanum
, with higher probability of the former. But from the POV of Levenshtein distance, both have a distance of 1.
The following snippet gives inconsistent results:
from reynir_correct import tokenize
texts = ["Skúta", "300 ára gömul írsk skúta fundin við Suður-Noreg" ]
for t in texts:
g = tokenize(t, only_ci=True)
for t in g:
if t.txt:
print(f"{t.txt:12} {t.error_code:8} {t.error_description}")
Output:
Skúta
300
ára
gömul
írsk
skúta U001 Óþekkt orð: 'skúta'
fundin
við
Suður-Noreg
The correct word skúta
is marked as unknown, but not if it's written as standalone word. Using no options for the tokenize()
method works as expected.
It's also not clear from the documentation, what exactly the optiononly_ci
does.
Could you maybe add a link inside the Readme to the networking API of GreynirCorrect (https://github.com/mideind/Yfirlestur) ?
GreynirCorrect is documented in detail here.
The given link in README.rst is dead.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.