Coder Social home page Coder Social logo

netherlandsforensicinstitute / demeuk Goto Github PK

View Code? Open in Web Editor NEW
14.0 14.0 2.0 146 KB

Demeuk is a simple tool to clean up corpora (like dictionaries) or any dataset containing plain text strings.

License: Apache License 2.0

Python 98.73% Dockerfile 1.27%
cleanup corpora corpus encoding hacktoberfest passwords

demeuk's People

Contributors

akaidiot avatar bartbroere avatar gingergeneste avatar rixvet avatar wineh avatar zyronix avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

demeuk's Issues

Splitting + unhexing goes wrong

When a line contains a hexed string and you'll use split function, it fails.

Example:
$cacheword$*1234567890abcdef1234567890abcdef*12345*1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef*1234567890abcdef12345678*1234567890abcdef1234567890abcdef:$HEX[4578616d706c653a24313233]
is added to the queue again as:

4578616d706c653a24313233
Which becomes: Example:$123
Split will be ran on this line AGAIN resulting in adding $123

Solution:
UNHEX should be ran AFTER splitting and not before splitting.

Feature: Length dropped based on entropy

Currently the length dropping in demeuk is done on unicode string length. People tent to make passwords with repeating sequences.

This item is a request to add an feature to demeuk that check for those repeating patterns and add items which do not meet the length requirement but have a repeating pattern.

Example:
passwordpasswordpasswordpassword

Refactoring commandline parsing

Currently the processing of commandline arguments is a bit messy. By switching to something like click or to newer versions of arg parse it is also easier to add default options more clearly and provider error message which argument is missing.

Autodetection

Demeuk gets more and more options, for newcomers I want to have a simple feature which scan a document and prints out the cleaning options.

Maybe you want to jump to 0% - 10% - 20% etc and scan 1% there. This will result in scanning 10% of the file, but should be sufficient.

Goal: scanning of a very large text files (lets say 10GB) should take max 1 min. For each module hit a small snippet should be given of which lines were hit.

So for example:
`
Check_hashes; found hashes; examples;
$h$7/uhfibmxg83yq6y1rh5y9wjee13kh.
$6$/fasjdfsadj$safjasdfasjdfasdjf/asdfsadfasdfasdfas/fadsfasdfa

Check_encoding; encoding decoding using 'utf-8'; examples

Check_controlchar; found control chars; example
\x07

Scanning done, run demeuk with:
demeuk.py -i -o --check-controlchar --check-hash --check-encoding
`
Also, some changing some default behavior:

  • Disable all options by default
  • Always use all cores (-j all)

Modify/Remove modules e.g. 'html' and 'http-named', 'remove-email' should have add module equivalent

Not all modules have a Add 'equivalent:

For example:

    --hex                           Replace lines like: $HEX[41424344] with ABCD.
    --html                          Replace lines like: şifreyok with şifreyok.
    --html-named                    Replace lines like: &#alpha; Those structures are more like passwords, so
                                    be careful to enable this option.
    --non-ascii                     Replace non ascii char with their replacement letters. For example ü
                                    becomes u, ç becomes c.

Remove modules (remove specific parts of a line):
    --remove-strip-punctuation      Remove starting and trailing punctuation
    --remove-punctuation            Remove all punctuation in a line
    --remove-email                  Enable email filter, this will catch strings like
                                    1238661:[email protected]:password

When creating dictionaries it's sometimes useful to keep both variants available.

Suggestions:

  • It could make sense to auto-generate the 'Add' equivalent in code.

empty lines in outputdict with --remove-puncuation

What?
While demeuking a file with --remove-punctuation with lines which contains all punctuation the resulting dictionary lines contains multiple blanks lines,

Reproduction?

$cat <<EOF > foo.dict
first-password
-
-+
-+-
-+-+
-+-+-
-+-+-+
-+-+-+-
-+-+-+-+
-+-+-+-+-
-+-+-+-+-+
-+-+-+-+-+-
-+-+-+-+-+-+
-+-+-+-+-+-+-
example
foobar
EOF

$ python3 demeuk.py -i foo.dict -o bar.dict --remove-punctuation

 uniq -c bar.dict
      1 firstpassword
     13
      1 example
      1 foobar

Expects?
Empty lines (or duplicated entries) to be removed from results.

Add RFC 2307 hash detection support to --hash

https://datatracker.ietf.org/doc/html/rfc2307

RFC 2307 (Experimental) suggests user passwords be hashed using a one-way (hopefully) cryptographically safe algorithm. They are often referred to as being "encrypted", but this is a misnomer (as they are not designed to be decrypted).

OpenLDAP supports RFC 2307 hashed passwords, including the {CRYPT}, {SSHA}, {SHA}, {SMD5}, {MD5}, and other schemes. Such passwords may be used as userPassword values and/or rootpw value.

Note: use of RFC 2307 Experimental passwords violates the Standard Track specification, RFC 2256, for user passwords and may lead to interoperability problems. 

https://www.openldap.org/faq/data/cache/346.html

--check-max-length(x) drops lines with length x

It depends if the definition of the word max implies that the given number includes the maximum value.

My expectation would be that if I drop lines that are above 16 characters, I give a value of --check-max-length 16. However, this will drop lines above 15.

Progress status

Small idea: print out status. The app already knows where it is (which chunk of all chunks) thus calculating the status is not difficult.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.