Coder Social home page Coder Social logo

pospell's People

Contributors

humitos avatar jdillard avatar julienpalard avatar mondeja avatar rtobar avatar seluj78 avatar xi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pospell's Issues

Bug: Words separated with a hypen and surrounding spaces are joined and showed as invalid

Steps to reproduce.

  1. Create a valid .po file with the following content
first - word
  1. Run pospell my_previous_file.po
  2. Check that it shows firstword as an invalid word as it trims the hyphen with the surrounding spaces.

The problem seems to be this regex https://github.com/JulienPalard/pospell/blob/master/pospell.py#L136 that is too abarcative. Perhaps we should allow a flag that has a more relaxed replacement policy?

Compound words not correctly handled by pospell

Consider this simple *.po file:

#
msgid ""
msgstr ""

msgid "pub/sub"
msgstr "pub/sub"

Let's try to generate an error for pub/sub with hunspell using es_ES dictionary:

$ echo 'pub/sub' | hunspell -d es_ES -l
pub
sub

If you run it, you can see that the words pub and sub are marked as incorrect, and separated each one by newline characters.

If I try the same with pospell, I have the next result:

$ pospell --language es_ES prueba.po

... nothing is marked as incorrect.

Problem

Printing the output of the call subprocess.run to hunspell, I can see: CompletedProcess(args=['hunspell', '-d', 'es_ES', '-l'], returncode=0, stdout='pub\nsub\n'). So the stdout is pub\nsub\n.

After the call to hunspell, in the source code appears this:

line_of_words = defaultdict(set)
for line, text in enumerate(text_for_hunspell.split("\n"), start=1):
    for word in text.split():
        line_of_words[word].add(line)
for misspelled_word in set(output.stdout.split("\n")):
    for line_number in line_of_words[misspelled_word]:
        errors.append((po_file, line_number, misspelled_word))

So, the code that finds the line numbers for the words supposes that hunspell doesn't splits the words if it found characters as /, -... With some prints in the source code is easy to understand:

print(line_of_words)
for misspelled_word in set(output.stdout.split("\n")):
    if misspelled_word not in line_of_words:
        print("---> mispelled word '%s' doesn't exists in line_of_words" % misspelled_word)
    for line_number in line_of_words[misspelled_word]:
        errors.append((po_file, line_number, misspelled_word))

The complete output is:

CompletedProcess(args=['hunspell', '-d', 'es_ES', '-l'], returncode=0, stdout='pub\nsub\n')
defaultdict(<class 'set'>, {'pub/sub': {6}})
---> mispelled word '' doesn't exists in line_of_words
---> mispelled word 'pub' doesn't exists in line_of_words
---> mispelled word 'sub' doesn't exists in line_of_words

so pub and sub are not added to errors list because their line numbers are not found.

Possible workaround

Compond words behaviour are related to compounding options of hunspell and depends on the dictionaries in use. Something like this may increase correct positives, but is a poor workaround:

import re

for line, text in enumerate(text_for_hunspell.split("\n"), start=1):
    for word in re.split(r' |/|-', text):
        line_of_words[word].add(line)

The ideal solution would be to parse the .aff file for the languages passed and create a set of compounding rules to split correctly the words.

Issues with soft hyphen

I sometimes use soft hyphens for long words in my translations. Example:

msgid "Inclusion/exclusion criteria"
msgstr "Ein-/Ausschluss­kriterien"

This gets reported as an error:

some/path.po:842:kriterien

Even if I add "kriterien" to personal dict this error stays.

IMHO the correct way to deal with this would be to ignore the soft hyphen. Not sure if this is an issue in pospell or the underlying spell checker.

Shall Pospel check capitalized Words?

I see that pospell do check Capitalized words, and hence python-docs-fr's dict is filled with first names and surnames like Farrugia, Catucci, Fredrik, Guido, Hettinger & co.
Maybe pospell should not verify capitalized names?

desole windows path

Traceback (most recent call last):
File "C:\Python34\Scripts\pospell-script.py", line 11, in
load_entry_point('pospell==0.0.3', 'console_scripts', 'pospell')
File "C:\Python34\lib\site-packages\pospell.py", line 41, in main
(tmpdir / po_file.name).write_text(po_to_text(str(po_file)))
AttributeError: 'WindowsPath' object has no attribute 'write_text'

idee?

AttributeError on docutils 0.18

docutils 0.18 was released today and seems to break pospell:

  File "/usr/local/lib/python3.9/dist-packages/pospell.py", line 119, in visit_Text
    self.output.append(node.rawsource)
AttributeError: 'Text' object has no attribute 'rawsource'

Dont spot "Partypolicularité"

on python-docs-fr:

sed -i s/Particularité/Partypolicularité/ sphinx.po
pospell -l fr -p dict sphinx.po → nothing

Specifically:

$ hunspell -d fr_FR -p dict -u3 test.txt 
$ cat test.txt 
Partypolicularité de l'implémentation.

Is it something I don't understand of -u3? I don't have the issue without it.

'Values' object has no attribute 'syntax_highlight'

Im getting next error running v1.0.7 in python-docs-es:

Traceback (most recent call last):
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/statemachine.py", line 310, in next_line
    self.line = self.input_lines[self.line_offset]
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/statemachine.py", line 1156, in __getitem__
    return self.data[i]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/statemachine.py", line 233, in run
    self.next_line()
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/statemachine.py", line 313, in next_line
    raise EOFError
EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mondeja/files/code/python-docs-es/venv/bin/pospell", line 8, in <module>
    sys.exit(main())
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/pospell.py", line 381, in main
    errors = spell_check(
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/pospell.py", line 296, in spell_check
    texts_for_hunspell[po_file] = po_to_text(str(po_file), drop_capitalized)
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/pospell.py", line 190, in po_to_text
    buffer.append(clear(strip_rst(entry.msgstr), drop_capitalized, po_path=po_path))
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/pospell.py", line 138, in strip_rst
    parser.parse(line, document)
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/parsers/rst/__init__.py", line 191, in parse
    self.statemachine.run(inputlines, document, inliner=self.inliner)
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/parsers/rst/states.py", line 170, in run
    results = StateMachineWS.run(self, input_lines, input_offset,
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/statemachine.py", line 248, in run
    result = state.eof(context)
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/parsers/rst/states.py", line 2712, in eof
    self.blank(None, context, None)
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/parsers/rst/states.py", line 2703, in blank
    paragraph, literalnext = self.paragraph(
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/parsers/rst/states.py", line 418, in paragraph
    textnodes, messages = self.inline_text(text, lineno)
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/parsers/rst/states.py", line 427, in inline_text
    nodes, messages = self.inliner.parse(text, lineno,
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/parsers/rst/states.py", line 646, in parse
    before, inlines, remaining, sysmessages = method(self, match,
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/parsers/rst/states.py", line 789, in interpreted_or_phrase_ref
    nodelist, messages = self.interpreted(rawsource, escaped, role,
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/parsers/rst/states.py", line 886, in interpreted
    nodes, messages2 = role_fn(role, rawsource, text, lineno, self)
  File "/home/mondeja/files/code/python-docs-es/venv/lib/python3.8/site-packages/docutils/parsers/rst/roles.py", line 335, in code_role
    inliner.document.settings.syntax_highlight)
AttributeError: 'Values' object has no attribute 'syntax_highlight'

Minimal reproducible example

msgid "Un rôle de code :code:`object().__str__`"
msgstr "A code role :code:`object().__str__`."

Seems that adding "syntax_highlight": "none", "syntax_highlight": "short" or "syntax_highlight": "long" in docutils.frontend.Values it's fixed, but I'm not totally secure of the side effects of this because after update to v1.0.7, adding this setting property I'm getting 23751 number of errors in python-docs-es against the 570 of v1.0.6.

Is the change in v1.0.7 a breaking change in the preprocessing step of pospell? Thanks for your work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.