⠀⠀⠀⠀⠀⠀⠀⣤⣤⣄⣀⡀⠀⠀⠀⢀⣠⣤⣤⣄⡀⠀⠀⠀⢀⣀⣠⣤⣤⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠸⣿⣿⡿⠿⢿⣷⡄⢠⣿⣿⣿⣿⣿⣿⡄⢀⣾⡿⠿⢿⣿⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠈⠉⠀⠀⢸⣿⡇⢸⣿⣿⣿⣿⣿⣿⡇⢸⣿⡇⠀⠀⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⣠⣤⡀⠀⠀⠀⠀⠀⠀⠀⢸⣿⡇⢸⣿⣿⣿⣿⣿⣿⡇⢸⣿⡇⠀⠀⠀⠀⠀⠀⠀⢀⣤⣄⠀⠀⠀
⠸⣿⣿⣿⣿⣿⣿⣿⣿⣦⠀⢸⣿⡇⢸⣿⣿⣿⣿⣿⣿⡇⢸⣿⡇⠀⣴⣿⣿⣿⣿⣿⣿⣿⣿⠇⠀⠀
⠀⠉⠉⠁⠀⠀⠀⠀⣿⣿⠀⢸⣿⡇⠀⠉⣿⣿⣿⣿⠉⠀⢸⣿⡇⠀⣿⣿⠀⠀⠀⠀⠈⠉⠉⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣀⣈⣻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣟⣁⣀⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠘⠿⠿⠿⠿⠿⣿⣿⣿⣿⣿⣿⣿⣿⠿⠿⠿⠿⠿⠃⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢀⣤⣤⣤⣤⣤⣤⣴⣿⣿⣿⡇⢸⣿⡿⣿⣦⣤⣤⣤⣤⣤⣤⡀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢸⣿⠋⠉⠉⠉⠉⠉⠉⢸⣿⡇⢸⣿⡇⠈⠉⠉⠉⠉⠉⠙⣿⣧⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢰⣿⣿⣦⠀⢰⣿⣿⣦⠀⢸⣿⡇⢸⣿⡇⠀⣰⣿⣿⡆⠀⣴⣿⣿⡆⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠈⠻⠿⠋⠀⠘⣿⣿⠃⠀⢸⣿⡇⢸⣿⡇⠀⠘⣿⣿⠃⠀⠙⠿⠟⠁⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢻⣿⣦⣤⣼⣿⠃⠘⣿⣧⣄⣤⣿⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠛⠛⠛⠁⠀⠀⠈⠛⠛⠛⠋⠀⠀⠀
⠀⠀⠀⠀⠀⠀ ⠀O C T O P I I⠀⠀⠀⠀
Copyright © 2023 RedHunt Labs Private Limited
Octopii is a Personally Identifiable Information (PII) scanner that uses Optical Character Recognition (OCR), regular expression lists and Natural Language Processing (NLP) to search public-facing locations for Government ID, addresses, emails etc in images, PDFs and documents.
PII leaks are often overlooked in the cybersecurity space. At RedHunt Labs, we always look for different and innovative ways to come up with cybersecurity solutions that organizations and services need. We've encountered a substantial number of organizations that have their servers configured incorrectly. This causes employee and customer PII to leak all the time, giving malicious parties sensitive information about their origins, ID numbers, contact information and their location.
This is why we created Octopii, a tool to demonstrate and detect how easy it is to automate the discovery and extraction of leaked PII and sensitive documents on the Internet.
- Install all dependencies via
pip install -r requirements.txt
. - Install the Tesseract helper locally via
sudo apt install tesseract-ocr -y
on Ubuntu orsudo pacman -Syu tesseract
on Arch Linux. - Install Spacy language definitions locally via
python -m spacy download en_core_web_sm
.
Once you've installed the above, you're all set.
To run Octopii, type
python3 octopii.py <location to scan>
where <location to scan>
is a file or a directory.
Octopii currently supports local scanning via filesystem path, S3 URLs and Apache open directory listings. You can also provide individual image URLs or files as an argument.
We've provided a dummy-pii/
folder containing sample PII for you to test Octopii with. Pass it as an argument and you'll get the following output
owais@artemis ~ $ python3 octopii.py dummy-pii/
Searching for PII in dummy-pii/dummy-drivers-license-nebraska-us.jpg
{
"file_path": "dummy-pii/dummy-drivers-license-nebraska-us.jpg",
"pii_class": "Nebraska Driver's License",
"country_of_origin": "United States",
"faces": 1,
"identifiers": [],
"emails": [],
"phone_numbers": [
"4000002170"
],
"addresses": [
"Nebraska"
]
}
Searching for PII in dummy-pii/dummy-PAN-India.jpg
{
"file_path": "dummy-pii/dummy-PAN-India.jpg",
"pii_class": "Permanent Account Number",
"country_of_origin": "India",
"faces": 0,
"identifiers": [],
"emails": [],
"phone_numbers": [],
"addresses": [
"INDIA"
]
}
...
A file named output.txt
is created, containing output from the tool. This file is appended to sequentially in real-time.
Octopii uses Tesseract for Optical Character Recognition (OCR) and NLTK for Natural Language Processing (NLP) to detect for strings of personal identifiable information. This is done via the following steps:
Octopii scans for images (jpg and png) and documents (pdf, doc, txt etc). It supports 3 sources:
- Amazon Simple Storage Service (S3): traverses the XML from S3 container URLs
- Open directory listing: traverses Apache open directory listings and scans for files
- Local filesystem: can access files and folders within UNIX-like filesystems (macOS and Linux-based operating systems)
Images are detected via Python Imaging Library (PIL) and are opened with OpenCV. PDFs are converted into a list of images and are scanned via OCR. Text-based file types are read into strings and are scanned without OCR.
A binary classification image detection technique - known as a "Haar cascade" - is used to detect faces within images. A pre-trained cascade model is supplied in this repo, which contains cascade data for OpenCV to use. Multiple faces can be detected within the same PII image, and the number of faces detected is returned.
Images are then "cleaned" for text extraction with the following image transformation steps:
- Auto-rotation
- Grayscaling
- Monochrome
- Mean threshold
- Gaussian threshold
- 3x Deskewing
Since these steps strip away image data (including colors in photographs), this image cleaning process occurs after attempting face detection.
Tesseract is used to grab all text strings from an image/file. It is then tokenized into a list of strings, split by newline characters ('\n') and spaces (' '). Garbled text, such as null
strings and single characters are discarded from this list, resulting in an 'intelligible' list of potential words.
This list of words is then fed into a similarity checker function. This function uses Gestalt pattern matching to compare each word extracted from the PII document with a list of keywords, present in definitions.json
. This check happens once per cleaning. The number of times a word occurs from the keywords list is counted and this is used to derive a confidence score. When a particular definition's keywords appear repeatedly in these scans, that definition gets the highest score and is picked as the predicted PII class.
Octopii also checks for sensitive PII substrings such as emails, phone numbers and common government ID unique identifiers using regular expressions. It can also extract geolocation data such as addresses and countries using Natural Language Processing.
The output consists of the following:
file_path
: Where the file containing PII can be foundpii_class
: The type of PII this file containscountry_of_origin
: Where this PII originates from.identifiers
: Unique identifiers, codes or numbers that may be used to target the individual mentioned in the PII.emails
andphone_numbers
: Contact information in the file.addresses
: Any form of geolocation data in the PII. This may be used to triangulate an individual's location.
Click here to read about how you can contribute to Octopii.
...and countless others
This tool is intended for research and educational purposes only. RedHunt Labs and other contributors to this project take no responsibility for malicious usage of this tool.
Copyright © 2023 RedHunt Labs Private Limited.
By Owais Shaikh
- Work: [email protected]
- Personal: [email protected]
octopii's People
Forkers
tuleo iamjohnbrown z5bra orxor qqvirus killvxk nyx2022 budhastudent exiahan paminhoff greatfanzy techsd orinocoz qq54288 ghurcka sekaki22 sopftf abdokaseb bellyfat robertjvt ashupup ramrod-0 thedevopsguru1 mrhou999 0x4f53 muhammadzubair220 ben-spec chris415 sjpi cyberdefender1 sahar042 furryfatkat elcin240 jeremiahn fancypanda2020 mbaroudi tanandy othmanalikhan-security drkiettran noorahsmith valteresj2 som3one0 anjanigourisaria jhoule-hyland ghost5683 soulabi santosh3743 robotica-labs python-popular-repos thomasxm 126789t katey909 saptarshi-08octopii's Issues
python3 octopii.py Traceback (most recent call last): File "/home/kali/Octopii/octopii.py", line 33, in <module> from keras.models import load_model ModuleNotFoundError: No module named 'keras'
Feature request: portable app
Would there be an easy way to make this portable so I could toss it on a thumb drive and run it on a random workstation?
Windows
Is your feature request related to a problem? Please describe.
I have a use case which is where I want to scan through backup files with Octopii on an SMB share. The capability works for this but there are some additional steps in that I have to make sure my Linux machine has access to the SMB share or the Backup file in question. If we could enable this to work on Windows as well this would help my use case.
Describe the solution you'd like
I am not sure how big this lift is, more than happy to help where possible. I have added the errors below that I see after confirming that the dependencies for windows are available.
It is not the end of the world but being able to run this from a Windows box would be better than having a dedicated Linux box for this task.
Additional context
When I run on Windows where I have already installed Tesseract I get the following:
Octopii python .\octopii.py .\dummy-pii\
Traceback (most recent call last):
File "C:\Users\Administrator\Documents\Octopii\octopii.py", line 123, in <module>
rules=text_utils.get_regexes()
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\Documents\Octopii\text_utils.py", line 52, in get_regexes
_rules = json.load(json_file)
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3062: character maps to <undefined>
Octopii crashes on empty files
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Run octopii against a folder with a 0 byte file in it
Traceback (most recent call last):
File "/opt/Octopii/octopii.py", line 199, in
results = search_pii (file_path)
File "/opt/Octopii/octopii.py", line 80, in search_pii
addresses = text_utils.regional_pii(text)
File "/opt/Octopii/text_utils.py", line 80, in regional_pii
place_entity = locationtagger.find_locations(text = text)
File "/usr/local/lib/python3.10/dist-packages/locationtagger/init.py", line 4, in find_locations
e = NamedEntityExtractor(url=url, text=text)
File "/usr/local/lib/python3.10/dist-packages/locationtagger/locationextractor.py", line 25, in init
raise Exception('Please input any text or url')
Exception: Please input any text or url
Expected behavior
It not to crash when a file is 0 bytes
ModuleNotFoundError: No module named 'cv2'
Describe the bug
ModuleNotFoundError: No module named 'cv2'
To Reproduce
Steps to reproduce the behavior:
- Run python3 octopii.py dummy-pii/ (Windows 11)
Expected behavior
Octopii runs successfully
UnboundLocalError: local variable 'contains_faces' referenced before assignment
Describe the bug
When running the tool on a directory without images or PDF files, an UnboundLocalError
is raised because the variable contains_faces
has not been initialized. I believe that adding contains_faces = 0
at the beginning of the search_pii(file_path)
function will solve the issue.
To Reproduce
Steps to reproduce the behavior:
- Create a directory
dir
only with text files - Run
python3 octopii.py dir/
Expected behavior
Octopii runs successfully
"WARNING:tensorFlow:No training configuration found..." When running tool
Greetings,
I became aware of this project via Intigriti's Bug Bytes newsletter. I went through the install using venv, but found that the following error is returned when I run the tool against the 'dummy-pii' local directory and the 'https://pii-carbonconsole.fra1.digitaloceanspaces.com' URL.
It seems to be working as expected as it returns a confidence value for the sample images containing "PII". I am running the tool within Kali 2022.1 using Python 3.9.12 within a virtualenv using venv. A GitHub issue for another project that lead me to add ", compile=False" to line 214 of the octopii.py script
I don't really understand the implications of the change, but it did result in the error no longer being returned. As I mentioned earlier, the tool seems to be working as expected, so to me it kind of seems like it is just "cosmetic".
This is an exciting project. Thank you for the time and effort put into developing it and sharing it with the world!
New PII-related regexes
Is your feature request related to a problem? Please describe.
I believe we can have more regexes for PII scanning. This can help expand the coverage of the tool.
Describe the solution you'd like
I discovered a website that has a good amount of regexes that I believe can be useful for Octopii: https://docs.trellix.com/bundle/data-loss-prevention-11.10.x-classification-definitions-reference-guide/page/GUID-66B1F12A-E267-4EEB-A9A5-A4398A6AF8CD.html
Additional context
None
Fails and crashes when encountering unfamiliar file: zip, .db, etc
Questions about "confidence_score"
Hi, I'm watching your sources and got some curiosity about your confidence scores.
Can I know your indications about your confidence scores? What are your standards about the score?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.