Coder Social home page Coder Social logo

bandrel / ocyara Goto Github PK

View Code? Open in Web Editor NEW
39.0 4.0 8.0 226 KB

Performs OCR on image files and scans them for matches to YARA rules

Home Page: https://pypi.python.org/pypi/OCyara/

License: GNU General Public License v3.0

Python 97.25% Dockerfile 2.75%
yara tesseract ocr optical-character-recognition python-3 python yara-rules tesseract-ocr-api

ocyara's Introduction

OCyara

Build Status

PyPI version

The OCyara module performs OCR (Optical Character Recognition) on image files and scans them for matches to Yara rules. OCyara also can process images embedded in PDF files. For more information about Yara, visit https://virustotal.github.io/yara/.

Installation

Operating System Requirements

  • Python 3.5+

  • Debian-based Linux distros are currently the only supported operating systems. Installation has only been tested on Kali Rolling and Ubuntu 16.10. (Other Debian-based distros may work as well, but may require manual compilation of Tesseract and/or Leptonica to get support for all image types. GIF, and TIFF library support seems to be troublesome with some Ubuntu LTS installations.)

  • Tesseract OCR API To install Tesseract:

    1. apt-get update
    2. Install python3 header files: apt-get install python3-dev
    3. Install Tesseract and its required libraries: apt-get install tesseract-ocr libtesseract-dev libleptonica-dev libpng12-dev libjpeg62-dev libtiff5-dev zlib1g-dev

Install Procedure

The easiest way to install OCyara is through the use of pip:

  1. Ensure all the Operating System Requirements listed above have been met
  2. Run pip install cython (has to be installed separate like this due to tesserocr currently lacking an "install_requires")
  3. Run pip install ocyara

Along with OCyara, the following other packages will be automatically installed:

Usage

OCyara Class Usage Examples

# Scan the current directory recursively for files that match rules in
# "rulefile.yara"

from ocyara import OCyara

ocy = OCyara('./', recursive=True)
ocy.run('rulefile.yara', file_magic=True)
print(ocy.list_matches())

Returns:

Visa tests/Example.pdf
SSN tests/Example.pdf
American_Express tests/Example.pdf
Diners_Club tests/Example.pdf
JCB tests/Example.pdf
Discover tests/Example.pdf
credit_card tests/Example.pdf
MasterCard tests/Example.pdf
card tests/Example.pdf

Each line printed has the rule that was matched and the file that matched it.

CLI usage Example

OCyara is not primarily intended to be used from the command line, but basic cli capablilities have been implemented to allow for easily-approachable testing of the library's core functionality.

usage: ocyara.py [-h] YARA_RULES_FILE TARGET_FILE/S`

positional arguments:

  YARA_RULES_FILE  Path of file containing yara rules
  TARGET_FILE/S    Directory or file name of images to scan.

optional arguments:
  -h, --help       show this help message and exit

OCyara Class Structure

class OCyara(builtins.object)
 |  Performs OCR (Optical Character Recognition) on image files and scans for matches to Yara rules.
 |
 |  OCyara also can process images embedded in PDF files.
 |
 |  Methods defined here:
 |
 |  __call__(self)
 |      Default call which outputs the results with the same output standard as the regular yara program
 |
 |  __init__(self, path:str, recursive=False, worker_count=6, verbose=0) -> None
 |      Create an OCyara object that can scan the specified directory or file and store the results.
 |
 |      Arguments:
 |          path -- File or directory to be processed
 |
 |      Keyword Arguments:
 |          recursive -- Whether the specified path should be recursivly searched for images (default False)
 |          worker_count -- The number of worker processes that should be spawned when
 |                          run() is executed (default available CPU cores * 2)
 |          verbose -- An int() from 0-2 that sets the verbosity level.
 |                     0 is default, 1 is information and 2 is debug
 |
 |  join(self, showprogress=True)
 |
 |  list_matched_rules(self) -> set
 |      Process the matchedfiles dictionary and return a list of rules that were matched.
 |
 |  list_matches(self, rules=None) -> typing.Dict
 |      List matched files and thier contexts (if available) in dictionary form.
 |
 |      Keyword Arguments:
 |
 |          rules -- Accepts a string or list of strings indicating specific rules.
 |            Only matches pertaining to the specified rule/s will be returned. If no
 |            rules are specified, all matches will be returned.
 |
 |  run(self, yara_rule:str, auto_join=True, file_magic=False, save_context=False) -> None
 |      Begin multithreaded processing of path files with the specified rule file.
 |
 |      Arguments:
 |          yara_rule -- A string file path of a Yara rule file
 |
 |      Keyword Arguments:
 |          auto_join -- If set to True, the main process will stall until all the
 |            worker processes have completed their work. If set to False, join()
 |            must be manually called following run() to ensure the queue is
 |            cleared and all workers have terminated.
 |
 |          show_progress -- Display a progress bar when join() is used.
 |
 |          file_magic -- If file_magic is enabled, ocyara will examine the contents
 |            of the target files to determine if they are an eligible image file
 |            type. For example, a JPEG file named 'picture.txt' will be processed by
 |            the OCR engine. file_magic uses the Linux "file" command.
 |
 |          save_context -- If True, when a file matches a yara rule, the returned
 |            results dictionary will also include the full ocr text of the matched
 |            file. This text can be further processed by the user if needed.
 |
 |  show_progress(self) -> None
 |      Generate a progress bar based on the number of items remaining in queue.
 |
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |
 |  check_file_type(path:str) -> str
 |      Use the Linux "file" command to determine a file's type based on contents
 |      instead of file extension.
 |
 |      Arguments:
 |          path -- A string file path to be processed
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  __dict__
 |      dictionary for instance variables (if defined)
 |
 |  __weakref__
 |      list of weak references to the object (if defined)
 |
 |  yara_output
 |      Returns the same output format as the standard yara program:
 |      RuleName FileName, FileName
 |      RuleName FileName...
 |
 |      Where:
 |        RuleName is the name of the rule that was matched
 |        FileName is the name of the file in which the match was found

ocyara's People

Contributors

bandrel avatar quietimcoding avatar ryman1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ocyara's Issues

Syntax Error when running test_example_pdf.py

Please excuse my newb question but I am trying to run test_example_pdf.py and i'm getting a syntax error. Issue shown in the attached png. Can't seem to figure out what i'm doing wrong. I'm running this on Kali rolling distro, just updated.

ocyara-syntaxerror

results match sample

when a yara rule is matched, we should store the contents of the match (i.e. the phone number that matches the yara regex)

reuse ocyara object

Add the ability to reset the object and/or execute run() again with out needing to create a whole separate object. Currently if run is exectued a 2nd time it just caused hanged workers to start, but nothing will happen because the queue is not repopulated.

python install setup.py broken

running setup.py install directly is currently broken.

Use pip install ocyara to install from package. We will keep PyPi up to date with master, but for development branch recommend using virtualenv and not installing the package for now.

update setup.py

update setup.py to indicate existence of prereqs to users who maybe be installing from pypi without referencing the github page.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.