netherlandsforensicinstitute / demeuk Goto Github PK
View Code? Open in Web Editor NEWDemeuk is a simple tool to clean up corpora (like dictionaries) or any dataset containing plain text strings.
License: Apache License 2.0
Demeuk is a simple tool to clean up corpora (like dictionaries) or any dataset containing plain text strings.
License: Apache License 2.0
When a line contains a hexed string and you'll use split function, it fails.
Example:
$cacheword$*1234567890abcdef1234567890abcdef*12345*1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef*1234567890abcdef12345678*1234567890abcdef1234567890abcdef:$HEX[4578616d706c653a24313233]
is added to the queue again as:
4578616d706c653a24313233
Which becomes: Example:$123
Split will be ran on this line AGAIN resulting in adding $123
Solution:
UNHEX should be ran AFTER splitting and not before splitting.
Currently the length dropping in demeuk is done on unicode string length. People tent to make passwords with repeating sequences.
This item is a request to add an feature to demeuk that check for those repeating patterns and add items which do not meet the length requirement but have a repeating pattern.
Example:
passwordpasswordpasswordpassword
Currently the processing of commandline arguments is a bit messy. By switching to something like click or to newer versions of arg parse it is also easier to add default options more clearly and provider error message which argument is missing.
The check-email catches [email protected]@Home
as a valid e-mail, but of course it is not.
Demeuk gets more and more options, for newcomers I want to have a simple feature which scan a document and prints out the cleaning options.
Maybe you want to jump to 0% - 10% - 20% etc and scan 1% there. This will result in scanning 10% of the file, but should be sufficient.
Goal: scanning of a very large text files (lets say 10GB) should take max 1 min. For each module hit a small snippet should be given of which lines were hit.
So for example:
`
Check_hashes; found hashes; examples;
$h$7/uhfibmxg83yq6y1rh5y9wjee13kh.
Check_encoding; encoding decoding using 'utf-8'; examples
Check_controlchar; found control chars; example
\x07
Scanning done, run demeuk with:
demeuk.py -i -o --check-controlchar --check-hash --check-encoding
`
Also, some changing some default behavior:
Not all modules have a Add 'equivalent:
For example:
--hex Replace lines like: $HEX[41424344] with ABCD.
--html Replace lines like: şifreyok with şifreyok.
--html-named Replace lines like: &#alpha; Those structures are more like passwords, so
be careful to enable this option.
--non-ascii Replace non ascii char with their replacement letters. For example ü
becomes u, ç becomes c.
Remove modules (remove specific parts of a line):
--remove-strip-punctuation Remove starting and trailing punctuation
--remove-punctuation Remove all punctuation in a line
--remove-email Enable email filter, this will catch strings like
1238661:[email protected]:password
When creating dictionaries it's sometimes useful to keep both variants available.
Suggestions:
phpass hashed (prefix:
For example used in wordpress:
https://github.com/WordPress/wordpress-develop/blob/5.8.1/src/wp-includes/class-phpass.php#L132
Add github template with the following checkes:
What?
While demeuking a file with --remove-punctuation
with lines which contains all punctuation the resulting dictionary lines contains multiple blanks lines,
Reproduction?
$cat <<EOF > foo.dict
first-password
-
-+
-+-
-+-+
-+-+-
-+-+-+
-+-+-+-
-+-+-+-+
-+-+-+-+-
-+-+-+-+-+
-+-+-+-+-+-
-+-+-+-+-+-+
-+-+-+-+-+-+-
example
foobar
EOF
$ python3 demeuk.py -i foo.dict -o bar.dict --remove-punctuation
uniq -c bar.dict
1 firstpassword
13
1 example
1 foobar
Expects?
Empty lines (or duplicated entries) to be removed from results.
https://datatracker.ietf.org/doc/html/rfc2307
RFC 2307 (Experimental) suggests user passwords be hashed using a one-way (hopefully) cryptographically safe algorithm. They are often referred to as being "encrypted", but this is a misnomer (as they are not designed to be decrypted).
OpenLDAP supports RFC 2307 hashed passwords, including the {CRYPT}, {SSHA}, {SHA}, {SMD5}, {MD5}, and other schemes. Such passwords may be used as userPassword values and/or rootpw value.
Note: use of RFC 2307 Experimental passwords violates the Standard Track specification, RFC 2256, for user passwords and may lead to interoperability problems.
It depends if the definition of the word max implies that the given number includes the maximum value.
My expectation would be that if I drop lines that are above 16 characters, I give a value of --check-max-length 16. However, this will drop lines above 15.
hashes with a starting
Aswell als hashes of length 20.
Small idea: print out status. The app already knows where it is (which chunk of all chunks) thus calculating the status is not difficult.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.