mideind / greynircorrect Goto Github PK

View Code? Open in Web Editor NEW

16.0 10.0 3.0 13.7 MB

Spelling and grammar correction for Icelandic

License: Other

Python 100.00%

nlp icelandic grammar spelling correction python python3 language reynir greynir

greynircorrect's People

Contributors

Stargazers

Watchers

Forkers

thorunna grammatek g-thor

greynircorrect's Issues

Single word / part of sentence correction

I want to use Greynir-Correct for correction of non-whole sentences, i.e. in extreme cases single words. What method or options should I use to make that possible ?

Currently, when using the tokenize() method with option only_ci=True, it complains about the following:

Maðurin      Z002     Orð á að byrja á hástaf: 'maðurin'
Maðurinn     Z002     Orð á að byrja á hástaf: 'maðurinn'

Sample code:

from reynir_correct import tokenize

texts = ["maðurin", "maðurinn" ]

for t in texts:
    g = tokenize(t, only_ci=True)
    for t in g:
        if t.txt:
            print(f"{t.txt:12} {t.error_code:8} {t.error_description}")

Tok skilgreiningar

Hæ,
Ég var að spá í hvaða tilfellum er þessi listi ekki tómur:

GreynirCorrect/src/reynir_correct/main.py

Line 191 in deec51e

if curr_sent:

Ég er reyndar að vinna með þetta án þess að nota gen fallið heldur vinnur bara með textan beint sem streng.

Ég var líka að prenta út TOK.END og fæ frozenset({10000, 12001, 10002, 11002}), í hvaða charactera er verið að vísa hérna?
Ef ég geri t.d. print(chr(10000)) fæ ég bara mjög skrítin tákn.

Cyclical suggestion

I have a open issue in Yfirlestur but it's probably more appropriate for GreynirCorrect, creating here for visability.
Issue in Yfirlestur

If we take the text Hann vil for example. GreynirCorrect will give two suggestion, the latter one being the same as the original input. What appears to be happening is that the latter suggestion is based on the input being the first suggestion, instead of being based off the original input Hann vil.
As a consequence I get this cyclical suggestion: Hann vil -> Hann vill -> Hann vil... There is no resolution for the word vil / vill

Response given by Yfirlestur for text Hann vil

{
    "result": [
        [
            {
                "annotations": [
                    {
                        "code": "P_wrong_person",
                        "detail": null,
                        "end": 1,
                        "end_char": 7,
                        "references": [],
                        "start": 0,
                        "start_char": 0,
                        "suggest": "Hann vill",
                        "suggestlist": null,
                        "text": "Orðasambandið 'Hann vil' var leiðrétt í 'Hann vill'"
                    },
                    {
                        "code": "BEYGVILLA",
                        "detail": "Beygingarmyndin 'vill' er ekki í samræmi við málvenju, 'vil' er ákjósanlegra.",
                        "end": 1,
                        "end_char": 7,
                        "references": [],
                        "start": 1,
                        "start_char": 4,
                        "suggest": "vil",
                        "suggestlist": null,
                        "text": "Beygingarvilla: 'vill' -> 'vil'"
                    }
                ],
                "corrected": "Hann vil",
                "nonce": "41903140",
                "original": "Hann vil",
                "token": "458f66a39f679f710e313e3d1e456e0971abd7405453b32543e47048d4351b2d",
                "tokens": [
                    {
                        "i": 0,
                        "k": 6,
                        "o": "Hann",
                        "x": "Hann"
                    },
                    {
                        "i": 4,
                        "k": 6,
                        "o": " vil",
                        "x": "vil"
                    }
                ]
            }
        ]
    ],
    "stats": {
        "ambiguity": 1.0,
        "num_chars": 8,
        "num_parsed": 1,
        "num_sentences": 1,
        "num_tokens": 2
    },
    "text": "Hann vil",
    "valid": true
}

If I use the first suggestion Hann vill and call this service again with my new string Hann vill I will get this suggestion (basically the latter suggestion again).

Response given by Yfirlestur for text Hann vill

{
    "result": [
        [
            {
                "annotations": [
                    {
                        "code": "BEYGVILLA",
                        "detail": "Beygingarmyndin 'vill' er ekki í samræmi við málvenju, 'vil' er ákjósanlegra.",
                        "end": 1,
                        "end_char": 8,
                        "references": [],
                        "start": 1,
                        "start_char": 4,
                        "suggest": "vil",
                        "suggestlist": null,
                        "text": "Beygingarvilla: 'vill' -> 'vil'"
                    }
                ],
                "corrected": "Hann vil",
                "nonce": "28078813",
                "original": "Hann vill",
                "token": "8d2b53caad5b029b1064172be9ca776a6c0b7b539af3e6b668973c937433ea7c",
                "tokens": [
                    {
                        "i": 0,
                        "k": 6,
                        "o": "Hann",
                        "x": "Hann"
                    },
                    {
                        "i": 4,
                        "k": 6,
                        "o": " vill",
                        "x": "vil"
                    }
                ]
            }
        ]
    ],
    "stats": {
        "ambiguity": 1.0,
        "num_chars": 9,
        "num_parsed": 1,
        "num_sentences": 1,
        "num_tokens": 2
    },
    "text": "Hann vill",
    "valid": true
}

Request: error code to ignore "sentence case capitalization"

This is currently possible by passing "Z002" to ignore_rules but not without losing more functionality.

Having different error codes for words that should start with an uppercase because they are the first word of a sentence and words that should always be written in uppercase e.g. Reykjavík would be useful.

Q: is it possible to get back ranked corrections ?

If there are multiple possibilities to correct a word, e.g. from POV of the same Levenshtein distance, is it possible to get back all possible corrections greynir-correct knows ? If yes: is there a possibility to even get back a ranked set of corrections ?

For example:

billinn er a kassanum

Could be corrected to billinn erákassanum or billinn eríkassanum, with higher probability of the former. But from the POV of Levenshtein distance, both have a distance of 1.

Strange behavior of tokenize(.., only_ci=True)

The following snippet gives inconsistent results:

from reynir_correct import tokenize

texts = ["Skúta", "300 ára gömul írsk skúta fundin við Suður-Noreg" ]
for t in texts:
    g = tokenize(t, only_ci=True)
    for t in g:
        if t.txt:
            print(f"{t.txt:12} {t.error_code:8} {t.error_description}")

Output:

Skúta                 
300                   
ára                   
gömul                 
írsk                  
skúta        U001     Óþekkt orð: 'skúta'
fundin                
við                   
Suður-Noreg

The correct word skúta is marked as unknown, but not if it's written as standalone word. Using no options for the tokenize() method works as expected.

It's also not clear from the documentation, what exactly the optiononly_ci does.

[Documentation] Link to Yfirlestur ?

Could you maybe add a link inside the Readme to the networking API of GreynirCorrect (https://github.com/mideind/Yfirlestur) ?

Dead link in README.rst

GreynirCorrect is documented in detail here.

The given link in README.rst is dead.

mideind / greynircorrect Goto Github PK

greynircorrect's People

Contributors

Stargazers

Watchers

Forkers

greynircorrect's Issues

Single word / part of sentence correction

Tok skilgreiningar

Cyclical suggestion

Request: error code to ignore "sentence case capitalization"

Q: is it possible to get back ranked corrections ?

Strange behavior of tokenize(.., only_ci=True)

[Documentation] Link to Yfirlestur ?

Dead link in README.rst

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent