Coder Social home page Coder Social logo

casics / nostril Goto Github PK

View Code? Open in Web Editor NEW
188.0 3.0 34.0 146.92 MB

Nostril: Nonsense String Evaluator

License: GNU Lesser General Public License v2.1

Python 88.64% CSS 8.92% Shell 2.44%
identifiers detector nonsense gibberish source-code mining-software-repositories identifier-string nonsense-string-evaluator inference text-processing

nostril's Introduction

Nostril

Nostril is the Nonsense String Evaluator: a Python module that infers whether a given short string of characters is likely to be random gibberish or something meaningful.

License: GPL v3 Python Latest release DOI JOSS DOI

Author: Michael Hucka (ORCID: 0000-0001-9105-5960)
Code repository: https://github.com/casics/nostril
License: Unless otherwise noted, this content is licensed under the GPLv3 license.

🏁 Recent news and activities

November 2019: Release version 1.2.0 changes the license for Nostril to LGPL version 2.1. There are no API or behavioral changes; all changes are limited to documentation strings, the README file, and a new DOI.

The file NEWS contains a more complete change log that includes information about previous releases.

Table of contents

☀ Introduction

A number of research efforts have investigated extracting and analyzing textual information contained in software artifacts. However, source code files can contain meaningless text, such as random text used as markers or test cases, and code extraction methods can also sometimes make mistakes and produce garbled text. When used in processing pipelines without human intervention, it is often important to include a data cleaning step before passing tokens extracted from source code to subsequent analysis or machine learning algorithms. Thus, a basic (and often unmentioned) step is to filter out nonsense tokens.

Nostril is a Python 3 module that can be used to infer whether a given word or text string is likely to be nonsense or meaningful text. Nostril takes a text string and returns True if it is probably nonsense, False otherwise. Meaningful in this case means a string of characters that is probably constructed from real or real-looking English words or fragments of real words (even if the words are run togetherlikethis). The main use case is to decide whether short strings returned by source code mining methods are likely to be program identifiers (of classes, functions, variables, etc.), or random characters or other non-identifier strings. To illustrate, the following example code,

from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
             'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
    print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))

produces the following output:

bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense

Nostril uses a combination of heuristic rules and a probabilistic assessment. It is not always correct (see below). It is tuned to reduce false positives: it is more likely to say something is not gibberish when it really might be. This is suitable for its intended purpose of filtering source code identifiers – a difficult problem, incidentally, because program identifiers often consist of acronyms and word fragments jammed together (e.g., "kBoPoMoFoOrderIdCID", "ioFlXFndrInfo", etc.), which can challenge even humans. Nevertheless, on the identifier strings from the Loyola University of Delaware Identifier Splitting Oracle, Nostril classifies over 99% correctly.

Nostril is reasonably fast: once the module is loaded, on a 4 Ghz Apple OS X 10.12 computer, calling the evaluation function returns a result in 30–50 microseconds per string on average.

♥️ Please cite the Spiral paper and the version you use

Article citations are critical for academic developers. If you use Nostril and you publish papers about work that uses Nostril, please cite the Nostril paper:

Hucka, M. (2018). Nostril: A nonsense string evaluator written in Python. Journal of Open Source Software, 3(25), 596, https://doi.org/10.21105/joss.00596

Please also use the DOI to indicate the specific version you use, to improve other people's ability to reproduce your results:

✺ Installation instructions

The following is probably the simplest and most direct way to install Nostril on your computer:

sudo pip3 install git+https://github.com/casics/nostril.git

Alternatively, you can clone this repository and then run setup.py:

git clone https://github.com/casics/nostril.git
cd nostril
sudo python3 -m pip install .

Both of these installation approaches should automatically install some Python dependencies that Nostril relies upon, namely plac, tabulate, humanize, and pytest.

► Using Nostril

The basic usage is very simple. Nostril provides a Python function named nonsense(). This function takes a single text string as an argument and returns a Boolean value as a result. Here is an example:

from nostril import nonsense
if nonsense('yoursinglestringhere'):
   print("nonsense")
else:
   print("real")

The Nostril source code distribution also comes with a command-line program called nostril. You can invoke the nostril command-line interface in two ways:

  1. Using the Python interpreter:
    python3 -m nostril
    
  2. On Linux and macOS systems, using the program nostril, which should be installed automatically by setup.py in a bin directory on your shell's command search path. Thus, you should be able to run it normally:
    nostril
    

The command-line program can take strings on the command line or (with the -f option) in a file, and will return nonsense-or-not assessments for each string. It can be useful for interactive testing and experimentation. For example:

# nostril bunchofwords xywinlist ioFlXFndrInfo lasaakldfalakj
xywinlist       [real]
ioFlXFndrInfo   [real]
lasaakldfalakj  [nonsense]
xyxyxyx         [nonsense]

Beware that the Nostril module takes a noticeable amount of time to load, and since the command-line program must reload the module anew each time, it is relatively slow as a means of using Nostril. (In normal usage, your program would only load the Python module once and not incur the loading time on every call.)

Nostril ignores numbers, spaces and punctuation characters embedded in the input string. This was a design decision made for practicality – it makes Nostril a bit easier to use. If, for your application, non-letter characters indicates a string that is definitely nonsense, then you may wish to test for that separately before passing the string to Nostril.

Please see the docs subdirectory for more information about Nostril and its operation.

🎯 Performance

You can verify the following results yourself by running the small test program tests/test.py. The following are the results on sets of strings that are all either real identifiers or all random/gibberish text:

Type of content Results
Test case Meaningful Gibberish False pos. False neg. Accuracy
/usr/share/dict/web2 218,752 0 89 0 99.96%
Ludiso oracle 2,540 0 6 0 99.76%
Auto-generated random strings 0 997,636 0 82,754 91.70%
Hand-written random strings 0 1,000 0 205 79.50%

In tests on real identifiers extracted from actual software source code mined by the author in another project, Nostril's performance is as follows:

Type of content Results
Test case Meaningful Gibberish False pos. False neg. Precision Recall
Strings mined from real code 4,261 364 6 5 98.36% 98.63%

⚠️ Limitations

Nostril is not fool-proof; it will generate some false positive and false negatives. This is an unavoidable consequence of the problem domain: without direct knowledge, even a human cannot recognize a real text string in all cases. Nostril's default trained system puts emphasis on reducing false positives (i.e., reducing how often it mistakenly labels something as nonsense) rather than false negatives, so it will sometimes report that something is not nonsense when it really is.

A vexing result is that this system does more poorly on supposedly "random" strings typed by a human. I hypothesize this is because those strings may be less random than they seem: if someone is asked to type junk at random on a QWERTY keyboard, they are likely to use a lot of characters from the home row (a-s-d-f-g-h-j-k-l), and those actually turn out to be rather common in English words. In other words, what we think of a strings "typed at random" on a keyboard are actually not that random, and probably have statistical properties similar to those of real words. These cases are hard for Nostril, but thankfully, in real-world situations, they are rare. This view is supported by the fact that Nostril's performance is much better on statistically random text strings generated by software.

Nostril has been trained using American English words, and is unlikely to work for other languages unchanged. However, the underlying framework may work if it were retrained on different sample inputs. Nostril uses uses n-grams coupled with a custom TF-IDF weighting scheme. See the subdirectory training for the code used to train the system.

Finally, the algorithm does not perform well on very short text, and by default Nostril imposes a lower length limit of 6 characters – strings must be longer than 6 characters or else it will raise an exception.

📚 More information

Please see the docs subdirectory for more information.

⁇ Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

♬ Contributing — info for developers

Any constructive contributions – bug reports, pull requests (code or documentation), suggestions for improvements, and more – are welcome. Please feel free to contact me directly, or even better, jump right in and use the standard GitHub approach of forking the repo and creating a pull request.

Everyone is asked to read and respect the code of conduct when participating in this project.

❤️ Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant Number 1533792 (Principal Investigator: Michael Hucka). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


             

nostril's People

Contributors

mhucka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

nostril's Issues

PyInstaller shows file ngram_data.pklz cannot be found

It looks like Nostril cannot work with PyInstaller ?

I got a Windows EXE but then I got this error :

File "site-packages\nostril\nonsense_detector.py", line 945, in
File "site-packages\nostril\nonsense_detector.py", line 621, in generate_nonsense_detector
ValueError: Cannot find pickle file C:\Users\user\AppData\Local\Programs\Python\Python37\projects\dn_util\dist\site-packages\nostril\ngram_data.pklz

after which, I copied the file ngram_data.pklz manually to the folder in which it is being sought,
but then another error occured (an Exception when I ran my program)

_[Errno 2] No such file or directory: 'C:\Users\user\AppData\Local\Temp\MEI49123\wordsegment\unigrams.txt'

So I believe, if the program was to work seamlessly with PyInstaller, I probably wouldn't need to touch anything.

Notice that everything works perfectly in the Sublime Text IDE.

Nonsense import error

This is the error I receive upon running the demo. Not sure if this is python version error or not.

from nostril import nonsense
ImportError: cannot import name 'nonsense' from 'nostril'.

Memory leak

There's a rather large memory leak in this library. I won't have the time to find it, but noticed it by forking (subprocess) a project many times and importing nostril causes the machine's RAM to quickly be used up. Without importing nostril the program runs fine with constant memory.

Update Caltech logo

It turns out the use of the Caltech seal used on the bottom of the README is discouraged by Caltech's media relationships guidelines. Needs to be switched to the Caltech "C" icon.

Update installation instructions

Since the time this was written, I've come up with slightly more clear installation instructions for installing from github. The instructions for Nostril should be updated.

Check words from file

I can’t check phrases from a text file. Perhaps I did not understand how to correctly transfer the file name

nostril -f wordlist.txt
nostril: error: argument -f/--file: invalid NoneType value: 'wordlist.txt'

nostril -f /home/user/nostril/wordlist.txt
nostril: error: argument -f/--file: invalid NoneType value: '/home/nekto/nostril/wordlist.txt

Archiving the project

Due to lack of time and changing jobs, I have to suspend further activity on this project. I am archiving it; if for some reason it becomes useful in the future, the project will be unarchived.

I thank the contributors, and am deeply sorry I was not able to incorporate the PRs and requested changes.

Would you consider a license suitable for use as a library in non GPL apps?

Hello!
I maintain https://github.com/nexB/scancode-toolkit and I have been looking for such a library which looks really nice (my use case is to remove gibberish from the strings collected from binaries). ScanCode is Apache-licensed. Would you consider some other licensing for nostril, such as any of a classpath-like exception to the GPL, the LGPL or any other permissive license?
Thank you for you kind consideration!

NB: incidentally, scancode may be of some use in the CASICS project.

Strings less than 6 characters

I can not work on strings less than 6 characters at all? I mean there is no why to go around the algorithm to test short characters?

Upload to PyPI

Make Nostril be available from PyPI and possibly other package hosting sites.

Program lacks a way to read input from stdin

To be honest, providing input from a file didn't work either, but in my opinion stdin support is of even greater importance in Linux environment.

I would expect this command pipeline to work:
printf "%s\n%s\n" "test" "kldfjgnkdgfjn" | nostril

Program crashes on some inputs

Getting the below error

Traceback (most recent call last):
  File "/usr/bin/nostril", line 25, in <module>
    plac.call(main)
  File "/usr/lib/python3.7/site-packages/plac_core.py", line 330, in call
    cmd, result = parser.consume(arglist)
  File "/usr/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/usr/lib/python3.7/site-packages/nostril/__main__.py", line 83, in main
    analyze(strings, trace)
  File "/usr/lib/python3.7/site-packages/nostril/__main__.py", line 95, in analyze
    'nonsense' if nonsense(s) else 'real'))
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 906-917: surrogates not allowed

when trying to detect gibberish from user-generated content.

How to train with additional corpora

Hello,
In the training directory, there's some brief explanation about training.
the README file is pointing to a not existing ngram file (there's no ngram_frequencies file in the repo).
Is it possible to describe a quick procedure how to perform the training to get the final ngram pickle file, and what are the requirements about the corpora file format?
thanks
Fausto

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.