Coder Social home page Coder Social logo

casics / nostril Goto Github PK

View Code? Open in Web Editor NEW
178.0 3.0 35.0 146.92 MB

Nostril: Nonsense String Evaluator

License: GNU Lesser General Public License v2.1

Python 88.64% CSS 8.92% Shell 2.44%
identifiers detector nonsense gibberish source-code mining-software-repositories identifier-string nonsense-string-evaluator inference text-processing

nostril's People

Contributors

mhucka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

nostril's Issues

Upload to PyPI

Make Nostril be available from PyPI and possibly other package hosting sites.

Program crashes on some inputs

Getting the below error

Traceback (most recent call last):
  File "/usr/bin/nostril", line 25, in <module>
    plac.call(main)
  File "/usr/lib/python3.7/site-packages/plac_core.py", line 330, in call
    cmd, result = parser.consume(arglist)
  File "/usr/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/usr/lib/python3.7/site-packages/nostril/__main__.py", line 83, in main
    analyze(strings, trace)
  File "/usr/lib/python3.7/site-packages/nostril/__main__.py", line 95, in analyze
    'nonsense' if nonsense(s) else 'real'))
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 906-917: surrogates not allowed

when trying to detect gibberish from user-generated content.

Would you consider a license suitable for use as a library in non GPL apps?

Hello!
I maintain https://github.com/nexB/scancode-toolkit and I have been looking for such a library which looks really nice (my use case is to remove gibberish from the strings collected from binaries). ScanCode is Apache-licensed. Would you consider some other licensing for nostril, such as any of a classpath-like exception to the GPL, the LGPL or any other permissive license?
Thank you for you kind consideration!

NB: incidentally, scancode may be of some use in the CASICS project.

Strings less than 6 characters

I can not work on strings less than 6 characters at all? I mean there is no why to go around the algorithm to test short characters?

Program lacks a way to read input from stdin

To be honest, providing input from a file didn't work either, but in my opinion stdin support is of even greater importance in Linux environment.

I would expect this command pipeline to work:
printf "%s\n%s\n" "test" "kldfjgnkdgfjn" | nostril

Update installation instructions

Since the time this was written, I've come up with slightly more clear installation instructions for installing from github. The instructions for Nostril should be updated.

Update Caltech logo

It turns out the use of the Caltech seal used on the bottom of the README is discouraged by Caltech's media relationships guidelines. Needs to be switched to the Caltech "C" icon.

PyInstaller shows file ngram_data.pklz cannot be found

It looks like Nostril cannot work with PyInstaller ?

I got a Windows EXE but then I got this error :

File "site-packages\nostril\nonsense_detector.py", line 945, in
File "site-packages\nostril\nonsense_detector.py", line 621, in generate_nonsense_detector
ValueError: Cannot find pickle file C:\Users\user\AppData\Local\Programs\Python\Python37\projects\dn_util\dist\site-packages\nostril\ngram_data.pklz

after which, I copied the file ngram_data.pklz manually to the folder in which it is being sought,
but then another error occured (an Exception when I ran my program)

_[Errno 2] No such file or directory: 'C:\Users\user\AppData\Local\Temp\MEI49123\wordsegment\unigrams.txt'

So I believe, if the program was to work seamlessly with PyInstaller, I probably wouldn't need to touch anything.

Notice that everything works perfectly in the Sublime Text IDE.

How to train with additional corpora

Hello,
In the training directory, there's some brief explanation about training.
the README file is pointing to a not existing ngram file (there's no ngram_frequencies file in the repo).
Is it possible to describe a quick procedure how to perform the training to get the final ngram pickle file, and what are the requirements about the corpora file format?
thanks
Fausto

Check words from file

I can’t check phrases from a text file. Perhaps I did not understand how to correctly transfer the file name

nostril -f wordlist.txt
nostril: error: argument -f/--file: invalid NoneType value: 'wordlist.txt'

nostril -f /home/user/nostril/wordlist.txt
nostril: error: argument -f/--file: invalid NoneType value: '/home/nekto/nostril/wordlist.txt

Memory leak

There's a rather large memory leak in this library. I won't have the time to find it, but noticed it by forking (subprocess) a project many times and importing nostril causes the machine's RAM to quickly be used up. Without importing nostril the program runs fine with constant memory.

Nonsense import error

This is the error I receive upon running the demo. Not sure if this is python version error or not.

from nostril import nonsense
ImportError: cannot import name 'nonsense' from 'nostril'.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.