phatpiglet / autocorrect Goto Github PK

View Code? Open in Web Editor NEW

173.0 173.0 61.0 3.45 MB

Python 3 Spelling Corrector

License: MIT License

Python 100.00%

autocorrect's People

Contributors

Stargazers

Watchers

Forkers

meego negm inthez ryanfreckleton sigvef ruchir594 akshaynagpal wisechengyi shenglixu anubhav722 hangshang1992 v2thegreat colinsongf kiranvarghesev samuelcahyawijaya bekerov naveenbandi rafis jivjot lemoncalamitous gengkunling csyhuang jiangzhonglian rishabhbatra10 danxidas tejashonmode wingniuqichao peeshees leishenvictoria angelo337 subratasarkar32 aucan renardferret lx2m17 ignertic filyp brandonmpace ranveer112 shivshankarkeshari waldow90 neverneverendup tanucdi stungkit sydowma rash150996 moly-malibu reham-qahwaji afiqmuzaffar wanlugu sandy4321 srinivasav22

autocorrect's Issues

Numbers are corrected as a mikstake.

Hello,
I found a problem by using spellcheck. the Numbers are treated as errors and tried to be corrctede, for example instead of 50 it shows of....

Multiple suggestions and estimations of confidence

The spell() function returns only single value that is the corrected word. Can you please add another function which returns two values (word, confidence) where confidence is percentage score from 0.0 to 1.0 showing how system confident what the provided word was misspelled. Also maybe it is good to provide multiple suggestions and confidence level for each of them, for example,

> spell_scores('pinrt')
[('print', 0.9), ('pint', 0.7)]

Currently autocorrect understands this typo in "pinrt" incorrectly.

add case correction tests

Spell doesn't load in Windows

I traced the issue to utils.py -> with closing(t.extractfile(tar_path)) as f:
it seems the tarfile library doesn't behave as intended on Windows. I extracted the file manually and changed the code to open it using os.open for now but I intended to make a pull request with a more general solution soon
Thanks for your effort

download other languages through a proxy

Hello,

I tried using your library on my working machine, which works through a proxy. When downloading a dictionary for the Russian language, the error "Tunnel connection failed: 407 Proxy Authentication Required" appears. I know how to deal with such errors when downloading dictionaries for Numpy or when cloning libraries from github, but I did not find how to do this for your library.

I can probably just download the data for the Russian language separately, but it’s still not clear where to put the file to make it work.

Jumbled words 'helloworld' could be split into 2 words

No idea how this could possibly work, but its definitely necessary for preprocessing text data

nlp_parser not improving accuracy as it should

utils -> words = set(...) levels the playing field

How to extend the library

Hi dear(s).
I want to add some other words and suggestions to this library, please tell me how can I do this?
Also, where are the reference files in source codes such as 'en_US_GB_CA_lower.txt' or 'en_US_GB_CA_mixed.txt'?

refactor unit tests to use nose

Freeze (memory/CPU chewed up) when trying to spell long string

spell('ç§�ã�Ÿã�¡ã�¯ãƒ‰ãƒ¼ãƒ“ãƒ«ã�«2009å¹´7æœˆã�«4æ³Šã�—ã�¾ã�—ã�Ÿã€‚ åœ°ä¸‹é‰„ã�‹ã‚‰2åˆ†ã€�ãƒ–ãƒ\xadãƒ¼ãƒ‰ã‚¦ã‚§ã‚¤ã�¾ã�§æ\xad©ã�„ã�¦ã‚‚ã€�ã��ã‚“ã�ªã�«æ°—ã�«ã�ªã‚Šã�¾ã�›ã‚“ã�§ã�—ã�Ÿã�‹ã‚‰ã€�ç«‹åœ°ã�§ã‚‚ã�™ã�¦ã��ã�ªãƒ›ãƒ†ãƒ«ã�§ã�™ã€‚çµŒå–¶è€…ã�‹ã�¨æ€�ã‚�ã‚Œã‚‹è€�å¤«å©¦ã�¨å¨˜ã�•ã‚“ã€�ã�‚ã�¨2äººã�®ã‚¹ã‚¿ãƒƒãƒ•ã�«å‡ºä¼šã�„ã�¾ã�—ã�Ÿã€‚ ã‚´ã‚¹ãƒšãƒ«ã�®æ‰€åœ¨åœ°ã‚’å°‹ã�\xadã�Ÿã‚‰ã€�ãƒ�ãƒƒãƒˆã�§åœ°å›³ã‚’ãƒ—ãƒªãƒ³ãƒˆã�—ã�¦ã��ã‚Œã�¦ã€�è¦ªåˆ‡ã�«èª¬æ˜Žã�—ã�¦ã��ã‚Œã�¾ã�—ã�Ÿã€‚ æ\xad´å�²ã‚’ä¿�ã�¨ã�†ã�¨ã�¨ã�—ã�¦ã�„ã‚‹ãƒ‹ãƒ¥ãƒ¼ãƒ¨ãƒ¼ã‚¯ã‚’æ\xad©ã��æ‹\xa0ç‚¹ã�¨ã�—ã�¦ã€�æœ€é�©ã�ªãƒ›ãƒ†ãƒ«ã�§ã�™ã€‚ ã‚¹ã‚¿ãƒ¼ãƒ�ãƒƒã‚¯ã‚¹ã€�ãƒžã‚¯ãƒ‰ãƒŠãƒ«ãƒ‰ã€�ã‚\xadãƒ³ã‚°ãƒ�ãƒ¼ã‚¬ã‚‚è¿‘ã��ã�«ã�‚ã‚Šã€�ã‚¨ãƒ³ãƒ‘ã‚¤ã‚¢ã‚¹ãƒ†ãƒ¼ãƒˆãƒ“ãƒ«è¿„10åˆ†å¼±ã�§ã�™ã�Œã€�ã‚³ãƒªã‚¢ã�®çµŒå–¶ã�™ã‚‹ã‚³ãƒ³ãƒ“ãƒ‹å…¼é£²é£Ÿåº—ã‚‚æ•°è»’æœ‰ã‚Šã€�ãƒ›ãƒ†ãƒ«ã�®è£�æ‰‹ã�®é€šã‚Šã�«ã�¯æ¶ˆè²»ç¨Žç„¡æ–™ã�®ã‚³ãƒ³ãƒ“ãƒ‹ã‚‚ã�‚ã�£ã�¦ä¾¿åˆ©ã�§ã�—ã�Ÿã€‚ 100å¹´ã�®æ\xad´å�²ã�¨ã�„ã�£ã�¦ã‚‚æ”¹è£…ã�•ã‚Œã�¦ã�„ã�¦æ¸…æ½”ã�ªãƒ›ãƒ†ãƒ«ã�§ã�™ã€‚æ‰‹å‹•ã�®ã‚¨ãƒ¬ãƒ™ãƒ¼ã‚¿ã‚‚å�°è±¡ã�«æ®‹ã‚Šã�¾ã�—ã�Ÿã€‚ å†·è”µåº«ã�Œã�ªã�„ç‚¹ã�¨ã€�ã‚¦ã‚¤ãƒ³ãƒ‰ã‚¦ã‚¿ã‚¤ãƒ—ã�®ã‚¨ã‚¢ã‚³ãƒ³ã�Œã�¡ã‚‡ã�£ã�¨ä¸�ä¾¿ã�ªç‚¹ä»¥å¤–ã�Šå‹§ã‚�ã�§ã�™ã€‚')

This causes memory to chew up to 6GB+ in a matter of seconds.

Took me all day to figure this out!

Would be good to include some sort of blacklist of weird characters and prevent the mysterious memory hog, eg throw an error (and provide some api to check if a string is spellable)

Speller() cannot be created since the tarf.extractfile

I install the latest version, i.e., 2.3.0 on Unbuntu. When I run the code:
spell = Speller()
It reports the error as follows:

self.nlp_data = load_from_tar(lang) if nlp_data is None else nlp_data
File "/home/dgl/virtual_env/textR/lib/python3.5/site-packages/autocorrect/init.py", line 78, in load_from_tar
return json.load(file)
File "/usr/lib/python3.5/json/init.py", line 268, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib/python3.5/json/init.py", line 312, in loads
s.class.name))
TypeError: the JSON object must be str, not 'bytes'

Autocorrect for other languages

Hello,

I've used this python library in my project, but it seems that only works with the English dictionary. Or does it work for other languages?
Because I would like to use to auto correct words in Portuguese.

Thanks,
Rita

Too slow to be useful

It takes 0.25 second (!) to correct a single word (“paiin”). If this stems from the code/algorithm design, I suggest to describe the package as a proof-of-concept or toy, not a spelling corrector.

Time out for function

Given a list of words, I'm looping trough them to correct any misspelled words. For now I'm stuck in the 5738 word for more than five minutes, with memory usage up to 12GB of RAM and disk usage of 120MB/s.
It would be nice to have a time out parameter to abort if the search is taking to long. It probably possible to optimize the memory usage either.

Too slow..

It takes 6+ seconds to give me the corrected word.

print(spell('wednsday'));
wednesday
[Finished in 6.5s]

print(spell('hello'));
hello
[Finished in 6.6s]

I would love to contribute to this project. Please tell me your preferred method of contact. Till then, I will go through the code.

spell function returns different out put for same word

Every time I restart kernel on Jupyter notebook and use spell function, it returns a different output for the same misspelled word.
Below are some of the examples: