netherlandsforensicinstitute / demeuk Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 2.0 146 KB

Demeuk is a simple tool to clean up corpora (like dictionaries) or any dataset containing plain text strings.

License: Apache License 2.0

Python 98.73% Dockerfile 1.27%

cleanup corpora corpus encoding hacktoberfest passwords

demeuk's People

Contributors

Stargazers

Watchers

Forkers

bartbroere jessevz

demeuk's Issues

Splitting + unhexing goes wrong

When a line contains a hexed string and you'll use split function, it fails.

Example:
$cacheword$*1234567890abcdef1234567890abcdef*12345*1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef*1234567890abcdef12345678*1234567890abcdef1234567890abcdef:$HEX[4578616d706c653a24313233]
is added to the queue again as:

4578616d706c653a24313233
Which becomes: Example:$123
Split will be ran on this line AGAIN resulting in adding $123

Solution:
UNHEX should be ran AFTER splitting and not before splitting.

Feature: Length dropped based on entropy

Currently the length dropping in demeuk is done on unicode string length. People tent to make passwords with repeating sequences.

This item is a request to add an feature to demeuk that check for those repeating patterns and add items which do not meet the length requirement but have a repeating pattern.

Example:
passwordpasswordpasswordpassword

Refactoring commandline parsing

Currently the processing of commandline arguments is a bit messy. By switching to something like click or to newer versions of arg parse it is also easier to add default options more clearly and provider error message which argument is missing.

E-mail detection false match when two '@' signs

The check-email catches [email protected]@Home as a valid e-mail, but of course it is not.

Oops

Autodetection

Demeuk gets more and more options, for newcomers I want to have a simple feature which scan a document and prints out the cleaning options.

Maybe you want to jump to 0% - 10% - 20% etc and scan 1% there. This will result in scanning 10% of the file, but should be sufficient.

Goal: scanning of a very large text files (lets say 10GB) should take max 1 min. For each module hit a small snippet should be given of which lines were hit.

So for example:
`
Check_hashes; found hashes; examples;
$h$7/uhfibmxg83yq6y1rh5y9wjee13kh.
$6$/fasjdfsadj$safjasdfasjdfasdjf/asdfsadfasdfasdfas/fadsfasdfa

Check_encoding; encoding decoding using 'utf-8'; examples

Check_controlchar; found control chars; example
\x07

Scanning done, run demeuk with:
demeuk.py -i -o --check-controlchar --check-hash --check-encoding
`
Also, some changing some default behavior:

Disable all options by default
Always use all cores (-j all)

Modify/Remove modules e.g. 'html' and 'http-named', 'remove-email' should have add module equivalent

Not all modules have a Add 'equivalent:

For example:

    --hex                           Replace lines like: $HEX[41424344] with ABCD.
    --html                          Replace lines like: &#351;ifreyok with şifreyok.
    --html-named                    Replace lines like: &#alpha; Those structures are more like passwords, so
                                    be careful to enable this option.
    --non-ascii                     Replace non ascii char with their replacement letters. For example ü
                                    becomes u, ç becomes c.

Remove modules (remove specific parts of a line):
    --remove-strip-punctuation      Remove starting and trailing punctuation
    --remove-punctuation            Remove all punctuation in a line
    --remove-email                  Enable email filter, this will catch strings like
                                    1238661:[email protected]:password

When creating dictionaries it's sometimes useful to keep both variants available.

Suggestions:

It could make sense to auto-generate the 'Add' equivalent in code.

Add phpass support to --hash detection

phpass hashed (prefix: $P$) are used within wordpress and currently not detected by the --hash detection.

For example used in wordpress:
https://github.com/WordPress/wordpress-develop/blob/5.8.1/src/wp-includes/class-phpass.php#L132

Add pull request templates

Add github template with the following checkes:

add tests?
add docs?
bump version?

empty lines in outputdict with --remove-puncuation

What?
While demeuking a file with --remove-punctuation with lines which contains all punctuation the resulting dictionary lines contains multiple blanks lines,

Reproduction?

$cat <<EOF > foo.dict
first-password
-
-+
-+-
-+-+
-+-+-
-+-+-+
-+-+-+-
-+-+-+-+
-+-+-+-+-
-+-+-+-+-+
-+-+-+-+-+-
-+-+-+-+-+-+
-+-+-+-+-+-+-
example
foobar
EOF

$ python3 demeuk.py -i foo.dict -o bar.dict --remove-punctuation

 uniq -c bar.dict
      1 firstpassword
     13
      1 example
      1 foobar

Expects?
Empty lines (or duplicated entries) to be removed from results.

Add RFC 2307 hash detection support to --hash

https://datatracker.ietf.org/doc/html/rfc2307

RFC 2307 (Experimental) suggests user passwords be hashed using a one-way (hopefully) cryptographically safe algorithm. They are often referred to as being "encrypted", but this is a misnomer (as they are not designed to be decrypted).

OpenLDAP supports RFC 2307 hashed passwords, including the {CRYPT}, {SSHA}, {SHA}, {SMD5}, {MD5}, and other schemes. Such passwords may be used as userPassword values and/or rootpw value.

Note: use of RFC 2307 Experimental passwords violates the Standard Track specification, RFC 2256, for user passwords and may lead to interoperability problems.

https://www.openldap.org/faq/data/cache/346.html