Coder Social home page Coder Social logo

greynircorrect's People

Contributors

g-thor avatar haukurpall avatar holado avatar peturorri avatar svanhvitlilja avatar sveinbjornt avatar thorunna avatar vthorsteinsson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

greynircorrect's Issues

Single word / part of sentence correction

I want to use Greynir-Correct for correction of non-whole sentences, i.e. in extreme cases single words. What method or options should I use to make that possible ?

Currently, when using the tokenize() method with option only_ci=True, it complains about the following:

Maðurin      Z002     Orð á að byrja á hástaf: 'maðurin'
Maðurinn     Z002     Orð á að byrja á hástaf: 'maðurinn'

Sample code:

from reynir_correct import tokenize

texts = ["maðurin", "maðurinn" ]

for t in texts:
    g = tokenize(t, only_ci=True)
    for t in g:
        if t.txt:
            print(f"{t.txt:12} {t.error_code:8} {t.error_description}")

Tok skilgreiningar

Hæ,
Ég var að spá í hvaða tilfellum er þessi listi ekki tómur:

Ég er reyndar að vinna með þetta án þess að nota gen fallið heldur vinnur bara með textan beint sem streng.

Ég var líka að prenta út TOK.END og fæ frozenset({10000, 12001, 10002, 11002}), í hvaða charactera er verið að vísa hérna?
Ef ég geri t.d. print(chr(10000)) fæ ég bara mjög skrítin tákn.

Cyclical suggestion

I have a open issue in Yfirlestur but it's probably more appropriate for GreynirCorrect, creating here for visability.
Issue in Yfirlestur

If we take the text Hann vil for example. GreynirCorrect will give two suggestion, the latter one being the same as the original input. What appears to be happening is that the latter suggestion is based on the input being the first suggestion, instead of being based off the original input Hann vil.
As a consequence I get this cyclical suggestion: Hann vil -> Hann vill -> Hann vil... There is no resolution for the word vil / vill

Response given by Yfirlestur for text Hann vil

{
    "result": [
        [
            {
                "annotations": [
                    {
                        "code": "P_wrong_person",
                        "detail": null,
                        "end": 1,
                        "end_char": 7,
                        "references": [],
                        "start": 0,
                        "start_char": 0,
                        "suggest": "Hann vill",
                        "suggestlist": null,
                        "text": "Orðasambandið 'Hann vil' var leiðrétt í 'Hann vill'"
                    },
                    {
                        "code": "BEYGVILLA",
                        "detail": "Beygingarmyndin 'vill' er ekki í samræmi við málvenju, 'vil' er ákjósanlegra.",
                        "end": 1,
                        "end_char": 7,
                        "references": [],
                        "start": 1,
                        "start_char": 4,
                        "suggest": "vil",
                        "suggestlist": null,
                        "text": "Beygingarvilla: 'vill' -> 'vil'"
                    }
                ],
                "corrected": "Hann vil",
                "nonce": "41903140",
                "original": "Hann vil",
                "token": "458f66a39f679f710e313e3d1e456e0971abd7405453b32543e47048d4351b2d",
                "tokens": [
                    {
                        "i": 0,
                        "k": 6,
                        "o": "Hann",
                        "x": "Hann"
                    },
                    {
                        "i": 4,
                        "k": 6,
                        "o": " vil",
                        "x": "vil"
                    }
                ]
            }
        ]
    ],
    "stats": {
        "ambiguity": 1.0,
        "num_chars": 8,
        "num_parsed": 1,
        "num_sentences": 1,
        "num_tokens": 2
    },
    "text": "Hann vil",
    "valid": true
}

If I use the first suggestion Hann vill and call this service again with my new string Hann vill I will get this suggestion (basically the latter suggestion again).

Response given by Yfirlestur for text Hann vill

{
    "result": [
        [
            {
                "annotations": [
                    {
                        "code": "BEYGVILLA",
                        "detail": "Beygingarmyndin 'vill' er ekki í samræmi við málvenju, 'vil' er ákjósanlegra.",
                        "end": 1,
                        "end_char": 8,
                        "references": [],
                        "start": 1,
                        "start_char": 4,
                        "suggest": "vil",
                        "suggestlist": null,
                        "text": "Beygingarvilla: 'vill' -> 'vil'"
                    }
                ],
                "corrected": "Hann vil",
                "nonce": "28078813",
                "original": "Hann vill",
                "token": "8d2b53caad5b029b1064172be9ca776a6c0b7b539af3e6b668973c937433ea7c",
                "tokens": [
                    {
                        "i": 0,
                        "k": 6,
                        "o": "Hann",
                        "x": "Hann"
                    },
                    {
                        "i": 4,
                        "k": 6,
                        "o": " vill",
                        "x": "vil"
                    }
                ]
            }
        ]
    ],
    "stats": {
        "ambiguity": 1.0,
        "num_chars": 9,
        "num_parsed": 1,
        "num_sentences": 1,
        "num_tokens": 2
    },
    "text": "Hann vill",
    "valid": true
}

Request: error code to ignore "sentence case capitalization"

This is currently possible by passing "Z002" to ignore_rules but not without losing more functionality.

Having different error codes for words that should start with an uppercase because they are the first word of a sentence and words that should always be written in uppercase e.g. Reykjavík would be useful.

Q: is it possible to get back ranked corrections ?

If there are multiple possibilities to correct a word, e.g. from POV of the same Levenshtein distance, is it possible to get back all possible corrections greynir-correct knows ? If yes: is there a possibility to even get back a ranked set of corrections ?

For example:

billinn er a kassanum

Could be corrected to billinn erákassanum or billinn eríkassanum, with higher probability of the former. But from the POV of Levenshtein distance, both have a distance of 1.

Strange behavior of tokenize(.., only_ci=True)

The following snippet gives inconsistent results:

from reynir_correct import tokenize

texts = ["Skúta", "300 ára gömul írsk skúta fundin við Suður-Noreg" ]
for t in texts:
    g = tokenize(t, only_ci=True)
    for t in g:
        if t.txt:
            print(f"{t.txt:12} {t.error_code:8} {t.error_description}")

Output:

Skúta                 
300                   
ára                   
gömul                 
írsk                  
skúta        U001     Óþekkt orð: 'skúta'
fundin                
við                   
Suður-Noreg

The correct word skúta is marked as unknown, but not if it's written as standalone word. Using no options for the tokenize() method works as expected.

It's also not clear from the documentation, what exactly the optiononly_ci does.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.