Coder Social home page Coder Social logo

yargen's Introduction

Actively Maintained

yarGen

                   _____
    __ _____ _____/ ___/__ ___
   / // / _ `/ __/ (_ / -_) _ \
   \_, /\_,_/_/  \___/\__/_//_/
  /___/  Yara Rule Generator
         Florian Roth, July 2020, Version 0.23.2

  Note: Rules have to be post-processed
  See this post for details: https://medium.com/@cyb3rops/121d29322282

What does yarGen do?

yarGen is a generator for YARA rules

The main principle is the creation of yara rules from strings found in malware files while removing all strings that also appear in goodware files. Therefore yarGen includes a big goodware strings and opcode database as ZIP archives that have to be extracted before the first use.

In version 0.24.0, yarGen introduces an output option (--ai). This feature generates a YARA rule with an expanded set of strings and includes instructions tailored for an AI. I suggest employing ChatGPT Plus with model 4 to refine these rules. Activating the --ai flag appends the instruction text to the yargen_rules.yar output file, which can subsequently be fed into your AI for processing.

With version 0.23.0 yarGen has been ported to Python3. If you'd like to use a version using Python 2, try a previous release. (Note that the download location for the pre-built databases has changed, since the database format has been changed from the outdated pickle to json. The old databases are still available but in an old location on our web server only used in the old yarGen version <0.23)

Since version 0.12.0 yarGen does not completely remove the goodware strings from the analysis process but includes them with a very low score depending on the number of occurrences in goodware samples. The rules will be included if no better strings can be found and marked with a comment /* Goodware rule */. Force yarGen to remove all goodware strings with --excludegood. Also since version 0.12.0 yarGen allows to place the "strings.xml" from PEstudio in the program directory in order to apply the blacklist definition during the string analysis process. You'll get better results.

Since version 0.14.0 it uses naive-bayes-classifier by Mustafa Atik and Nejdet Yucesoy in order to classify the string and detect useful words instead of compression/encryption garbage.

Since version 0.15.0 yarGen supports opcode elements extracted from the .text sections of PE files. During database creation it splits the .text sections with the regex [\x00]{3,} and takes the first 16 bytes of each part to build an opcode database from goodware PE files. During rule creation on sample files it compares the goodware opcodes with the opcodes extracted from the malware samples and removes all opcodes that also appear in the goodware database. (there is no further magic in it yet - no XOR loop detection etc.) The option to activate opcode integration is '--opcodes'.

Since version 0.17.0 yarGen allows creating multiple databases for opcodes and strings. You can now easily create a new database by using "-c" and an identifier "-i identifier" e.g. "office". It will then create two new database files named "good-strings-office.db" and "good-opcodes-office.db" that will be initialized during startup with the built-in databases.

Since version 0.18.0 yarGen supports extra conditions that make use of the pe module. This includes imphash values and the PE file's exports. We provide pre-generated imphash and export databases.

Since version 0.19.0 yarGen support a 'dropzone' mode in which it initializes all strings/opcodes/imphashes/exports only once and queries a given folder for new samples. If it finds new samples dropped to the folder, it creates rules for these samples, writes the YARA rules to the defined output file (default: yargen_rules.yar) and removes the dropped samples. You can specify a text file (-b) from which the identifier is read. The reference parameter (-r) has also been extended so that it can be a text file on disk from which the reference is read. E.g. drop two files named 'identifier.txt' and 'reference.txt' together with the samples to the folder and use the parameters -b ./dropzone/identifier.txt and -r ./dropzone/reference.txt to read the respective strings from the files each time an analysis starts.

Since version 0.20.0 yarGen supports the extraction and use of hex encoded strings that often appear in weaponized RTF files.

The rule generation process also tries to identify similarities between the files that get analyzed and then combines the strings to so called super rules. The super rule generation does not remove the simple rule for the files that have been combined in a single super rule. This means that there is some redundancy when super rules are created. You can suppress a simple rule for a file that was already covered by super rule by using --nosimple.

Installation

  1. Make sure you have at least 4GB of RAM on the machine you plan to use yarGen (8GB if opcodes are included in rule generation, use with --opcodes)
  2. Install all dependencies with pip install -r requirements.txt (or pip3 install -r requirements.txt)
  3. Run python yarGen.py --update to automatically download the built-in databases. The are saved into the './dbs' sub folder. (Download: 913 MB)
  4. See help with python yarGen.py --help for more information on the command line parameters

Memory Requirements

Warning: yarGen pulls the whole goodstring database to memory and uses at least 3 GB of memory for a few seconds - 6 GB if opcodes evaluation is activated (--opcodes).

I've already tried to migrate the database to sqlite but the numerous string comparisons and lookups made the analysis painfully slow.

Post-Processing Video Tutorial

YARA rule post-processing video tutorial

Multiple Database Support

yarGen allows creating multiple databases for opcodes or strings. You can easily create a new database by using "-c" for new database creation and "-i identifier" to give the new database a unique identifier as e.g. "office". It will the create two new database files named "good-strings-office.db" and "good-opcodes-office.db" that will from then on be initialized during startup with the built-in databases.

Database Creation / Update Example

Create a new strings and opcodes database from an Office 2013 program directory:

yarGen.py -c --opcodes -i office -g /opt/packs/office2013

The analysis and string extraction process will create the following new databases in the "./dbs" sub folder.

good-strings-office.db
good-opcodes-office.db

The values from these new databases will be automatically applied during the rule creation process because all *.db files in the sub folder "./dbs" will be initialized during startup.

You can update the once created databases with the "-u" parameter

yarGen.py -u --opcodes -i office -g /opt/packs/office365

This would update the "office" databases with new strings extracted from files in the given directory.

Command Line Parameters

usage: yarGen.py [-h] [-m M] [-y min-size] [-z min-score] [-x high-scoring]
                 [-w superrule-overlap] [-s max-size] [-rc maxstrings]
                 [--excludegood] [-o output_rule_file] [-e output_dir_strings]
                 [-a author] [-r ref] [-l lic] [-p prefix] [-b identifier]
                 [--score] [--strings] [--nosimple] [--nomagic] [--nofilesize]
                 [-fm FM] [--globalrule] [--nosuper] [--update] [-g G] [-u]
                 [-c] [-i I] [--dropzone] [--nr] [--oe] [-fs size-in-MB]
                 [--noextras] [--debug] [--trace] [--opcodes] [-n opcode-num]

yarGen

optional arguments:
  -h, --help            show this help message and exit

Rule Creation:
  -m M                  Path to scan for malware
  -y min-size           Minimum string length to consider (default=8)
  -z min-score          Minimum score to consider (default=0)
  -x high-scoring       Score required to set string as 'highly specific
                        string' (default: 30)
  -w superrule-overlap  Minimum number of strings that overlap to create a
                        super rule (default: 5)
  -s max-size           Maximum length to consider (default=128)
  -rc maxstrings        Maximum number of strings per rule (default=20,
                        intelligent filtering will be applied)
  --excludegood         Force the exclude all goodware strings

Rule Output:
  -o output_rule_file   Output rule file
  -e output_dir_strings
                        Output directory for string exports
  -a author             Author Name
  -r ref                Reference (can be string or text file)
  -l lic                License
  -p prefix             Prefix for the rule description
  -b identifier         Text file from which the identifier is read (default:
                        last folder name in the full path, e.g. "myRAT" if -m
                        points to /mnt/mal/myRAT)
  --score               Show the string scores as comments in the rules
  --strings             Show the string scores as comments in the rules
  --nosimple            Skip simple rule creation for files included in super
                        rules
  --nomagic             Don't include the magic header condition statement
  --nofilesize          Don't include the filesize condition statement
  -fm FM                Multiplier for the maximum 'filesize' condition value
                        (default: 3)
  --globalrule          Create global rules (improved rule set speed)
  --nosuper             Don't try to create super rules that match against
                        various files

Database Operations:
  --update              Update the local strings and opcodes dbs from the
                        online repository
  -g G                  Path to scan for goodware (dont use the database
                        shipped with yaraGen)
  -u                    Update local standard goodware database with a new
                        analysis result (used with -g)
  -c                    Create new local goodware database (use with -g and
                        optionally -i "identifier")
  -i I                  Specify an identifier for the newly created databases
                        (good-strings-identifier.db, good-opcodes-
                        identifier.db)

General Options:
  --dropzone            Dropzone mode - monitors a directory [-m] for new
                        samples to processWARNING: Processed files will be
                        deleted!
  --nr                  Do not recursively scan directories
  --oe                  Only scan executable extensions EXE, DLL, ASP, JSP,
                        PHP, BIN, INFECTED
  -fs size-in-MB        Max file size in MB to analyze (default=10)
  --noextras            Don't use extras like Imphash or PE header specifics
  --debug               Debug output
  --trace               Trace output

Other Features:
  --opcodes             Do use the OpCode feature (use this if not enough high
                        scoring strings can be found)
  -n opcode-num         Number of opcodes to add if not enough high scoring
                        string could be found (default=3)

Best Practice

See the following blog posts for a more detailed description on how to use yarGen for YARA rule creation:

How to Write Simple but Sound Yara Rules - Part 1

How to Write Simple but Sound Yara Rules - Part 2

How to Write Simple but Sound Yara Rules - Part 3

Screenshots

Generator Run

Output Rule

As you can see in the screenshot above you'll get a rule that contains strings, which are not found in the goodware strings database.

You should clean up the rules afterwards. In the example above, remove the strings $s14, $s17, $s19, $s20 that look like random code to get a cleaner rule that is more likely to match on other samples of the same family.

To get a more generic rule, remove string $s5, which is very specific for this compiled executable.

Examples

Dropzone Mode (Recommended)

Monitors a given folder (-m) for new samples, processes the samples, writes YARA rules to the set output file (default: yargen_rules.yar) and deletes the folder contents afterwards.

python yarGen.py -a "yarGen Dropzone" --dropzone -m /opt/mal/dropzone

WARNING: All files dropped to the set dropzone will be removed!

In the following example two files named identifier.txt and reference.txt are read and used for the reference and as identifier in the YARA rule sets. The files are read at each iteration and not only during initialization. This way you can pass specific strings to each dropzone rule generation.

python yarGen.py --dropzone -m /opt/mal/dropzone -b /opt/mal/dropzone/identifier.txt -r /opt/mal/dropzone/reference.txt

Use the shipped database (FAST) to create some rules

python yarGen.py -m X:\MAL\Case1401

Use the shipped database of goodware strings and scan the malware directory "X:\MAL" recursively. Create rules for all files included in this directory and below. A file named 'yargen_rules.yar' will be generated in the current directory.

Show the score of the strings as comment

yarGen will by default use the top 20 strings based on their score. To see how a certain string in the rule scored, use the "--score" parameter.

python yarGen.py --score -m X:\MAL\Case1401

Use only strings with a certain minimum score

In order to use only strings for your rules that match a certain minimum score use the "-z" parameter. It is a good pratice to first create rules with "--score" and than perform a second run with a minimum score set for you sample set via "-z".

python yarGen.py --score -z 5 -m X:\MAL\Case1401

Preset author and reference

python yarGen.py -a "Florian Roth" -r "http://goo.gl/c2qgFx" -m /opt/mal/case_441 -o case441.yar

Add opcodes to the rules

python yarGen.py --opcodes -a "Florian Roth" -r "http://goo.gl/c2qgFx" -m /opt/mal/case33 -o rules33.yar

Show debugging output

python yarGen.py --debug -m /opt/mal/case_441

Create a new goodware strings database

python yarGen.py -c --opcodes -g /home/user/Downloads/office2013 -i office

This will generate two new databases for strings and opcodes named:

  • good-strings-office.db
  • good-opcodes-office.db

The new databases will automatically be initialized during startup and are from then on used for rule generation.

Update a goodware strings database (append new strings, opcodes, imphashes, exports to the old ones)

python yarGen.py -u -g /home/user/Downloads/office365 -i office

My Best Pratice Command Line

python yarGen.py -a "Florian Roth" -r "Internal Research" -m /opt/mal/apt_case_32

db-lookup.py

A tool named db-lookup.py, which was introduced with version 0.18.0 allows you to query the local databases in a simple command line interface. The interface takes an input value, which can be string, export or imphash value, detects the query type and then performs a lookup in the loaded databases. This allows you to query the yarGen databases with string, export and imphash values in order to check if this value appears in goodware that has been processed to generate the databases.

This is a nice feature that helps you ta answer the following questions:

  • Does this string appear in goodware samples of my database?
  • Does this export name appear in goodware samples of my database?
  • Does a sample in my goodware database has this imphash?

However, there are several drawbacks:

  • It does only match on the full string (no contains, no startswith, no endswith)
  • Opcode lookup is not supported (yet)

I plan to release a new project named Valknut which extracts overlapping byte sequences from samples and creates searchable databases. This project will be the new backend API for yarGen allowing all kinds of queries, opcodes and string values, ascii and wide formatted.

yargen's People

Contributors

crimsonglory avatar gitter-badger avatar iam-py-test avatar neo23x0 avatar ruppde avatar seekamoon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yargen's Issues

Escaping really needed?

is this really needed in yarGen.py? strings with double quotes end up with 5 backslashes when creating new json files with "-g -c", e.g.

             "\\\\\"isexe@": 1,

probably they get escaped a 2nd time by the json-export function?

            # Escape strings
            if len(string) > 0:
                string = string.replace(b'\\', b'\\\\')
                string = string.replace(b'"', b'\\"')

looks to me like it works without these 3 lines (which results in only one backslah before a doublequote) but I'm new to yarGen??

ModuleNotFoundError on Update

Using a venv from Python 3.7.2 and the following installed via pip:
Successfully installed future-0.17.1 lxml-4.2.5 naiveBayesClassifier-0.1.3 pefile-2018.8.8 scandir-1.9.0

$ python yarGen.py --update
Traceback (most recent call last):
  File "yarGen.py", line 26, in <module>
    from naiveBayesClassifier.trainer import Trainer
  File "/Users/user/Downloads/samples/generate/lib/python3.7/site-packages/naiveBayesClassifier/trainer.py", line 1, in <module>
    from naiveBayesClassifier.trainedData import TrainedData
  File "/Users/user/Downloads/samples/generate/lib/python3.7/site-packages/naiveBayesClassifier/trainedData.py", line 1, in <module>
    from ExceptionNotSeen import NotSeen
ModuleNotFoundError: No module named 'ExceptionNotSeen'

Issue on Linux directories

The massage error after run about args.m < 2

You making app run on windows not compatible with os.path Linux

Import scandir error

After pulling down yarGen and trying to update, the following error occurs:

Traceback (most recent call last):
File "yarGen.py", line 21, in
import scandir
ModuleNotFoundError: No module named 'scandir'

However, the dependency already exists:

Requirement already satisfied: scandir in /usr/local/lib/python3.8/site-packages (from -r requirements.txt (line 1)) (1.10.0)
Requirement already satisfied: pefile in /usr/local/lib/python3.8/site-packages (from -r requirements.txt (line 2)) (2019.4.18)
Requirement already satisfied: lxml in /usr/local/lib/python3.8/site-packages (from -r requirements.txt (line 3)) (4.6.1)
Requirement already satisfied: future in /usr/local/lib/python3.8/site-packages (from pefile->-r requirements.txt (line 2)) (0.18.2)

Unsure how to proceed...

macOS Catalina
Version 10.15.7
MacBook Pro (16-inch, 2019)
Processor 2.6 GHz 6-Core Intel Core i7
Memory 16 GB 2667 MHz DDR4
Graphics AMD Radeon Pro 5300M 4 GB
Intel UHD Graphics 630 1536 MB

Problem when installing dependencies

Hi

I just want to advise one thing I came across when installing
use of "pip install" is getting deprecated, since yarGen is still under python2, when your main installation is python3 you may find issues trying to do directly via "pip" or even "pip2" install -r requirements, or even the packets mentioned in the README.

So, the solution in my case was instead of using "pip install scandir lxml naiveBayesClassifier pefile" I would recommend to do it explicit via "python2.7 -m pip install scandir lxml naiveBayesClassifier pefile"

18.04.1 Ubuntu

Thanks!

Gzip memory error

I get a memory error during the update command.
'python yarGen.py -update'

There are my memory state.
Total Memory : 15.8GB
Used Memory : 8.0GB

There are error message.
[+] Loading ./dbs/good-strings-part5.db ...
Traceback (most recent call last):
File "yarGen.py", line 1895, in
good_pickle = load(get_abs_path(filePath))
File "yarGen.py", line 1598, in load
data = file.read()
File "C:\Python27\lib\gzip.py", line 261, in read
self._read(readsize)
File "C:\Python27\lib\gzip.py", line 320, in _read
self._add_read_data( uncompress )
File "C:\Python27\lib\gzip.py", line 338, in _add_read_data
self.extrabuf = self.extrabuf[offset:] + data
MemoryError

opcode extraction fails

when .text section is not named .text (in case its randomize) function extract_opcodes fails.
Please add the following to the function.

`def extract_opcodes(filePath):
# String list
opcodes = []

# Read file data
try:
    print "[-] Extracting OpCodes: %s" % filePath

    pe = pefile.PE(filePath)
    name = ""
    ep = pe.OPTIONAL_HEADER.AddressOfEntryPoint
    pos = 0
    for sec in pe.sections:
        if (ep >= sec.VirtualAddress) and \
           (ep < (sec.VirtualAddress + sec.Misc_VirtualSize)):
            name = sec.Name.replace('\x00', '')
            break
        else:
            pos += 1

    for section in pe.sections:
        if section.Name.rstrip("\x00") == name:
            text = section.get_data()
            # Split text into subs
            text_parts = re.split("[\x00]{3,}", text)
            # Now truncate and encode opcodes
            for text_part in text_parts:
                if text_part == '' or len(text_part) < 8:
                    continue
                opcodes.append(text_part[:16].encode('hex'))

except Exception,e:
    if args.debug:
        traceback.print_exc()
    pass

return opcodes

`

[question] [E] Missing goodware imphash/export databases. Error question

I tried to create the yargen_rules.yar rule, but the following error message occurred.
All db files under the ./dbs/ directory exist normally, but an error occurs during execution. May I inquire about any problems?

========================================================================================

[...]# /root/hacking/yarGen-0.21.2/yarGen.py -m /root/hacking/hackfile/
###############################################################################
______
__ ______ / / ____
/ / / / __ `/ / / __/ _ / __
/ /
/ / /
/ / / / /
/ / __/ / / /
_
, /_
,
/
/ _
/_
_// //
/____/

Yara Rule Generator by Florian Roth
December 2018
Version 0.21.1

###############################################################################
[+] Using identifier 'hacking'
[+] Using reference 'https://github.com/Neo23x0/yarGen'
[+] Using prefix 'hacking'
[+] Processing PEStudio strings ...
[+] Reading goodware strings from database 'good-strings.db' ...
(This could take some time and uses at least 3 GB of RAM)
[+] Loading ./dbs/good-exports-part1.db ...
Traceback (most recent call last):
File "/root/hacking/yarGen-0.21.2/yarGen.py", line 2407, in
good_exports_pickle = load(get_abs_path(filePath))
File "/root/hacking/yarGen-0.21.2/yarGen.py", line 1912, in load
object = pickle.loads(buffer)
File "/usr/lib64/python2.7/pickle.py", line 1382, in loads
return Unpickler(file).load()
File "/usr/lib64/python2.7/pickle.py", line 858, in load
dispatchkey
KeyError: '{'
[+] Loading ./dbs/good-imphashes-part8.db ...
Traceback (most recent call last):
File "/root/hacking/yarGen-0.21.2/yarGen.py", line 2396, in
good_imphashes_pickle = load(get_abs_path(filePath))
File "/root/hacking/yarGen-0.21.2/yarGen.py", line 1912, in load
object = pickle.loads(buffer)
File "/usr/lib64/python2.7/pickle.py", line 1382, in loads
return Unpickler(file).load()
File "/usr/lib64/python2.7/pickle.py", line 858, in load
dispatchkey
KeyError: '{'
[+] Loading ./dbs/good-imphashes-part3.db ...
Traceback (most recent call last):
File "/root/hacking/yarGen-0.21.2/yarGen.py", line 2396, in
good_imphashes_pickle = load(get_abs_path(filePath))
File "/root/hacking/yarGen-0.21.2/yarGen.py", line 1912, in load
object = pickle.loads(buffer)
File "/usr/lib64/python2.7/pickle.py", line 1382, in loads
return Unpickler(file).load()
File "/usr/lib64/python2.7/pickle.py", line 858, in load
dispatchkey
KeyError: '{'
[+] Loading ./dbs/good-imphashes-part4.db ...
Traceback (most recent call last):
File "/root/hacking/yarGen-0.21.2/yarGen.py", line 2396, in
good_imphashes_pickle = load(get_abs_path(filePath))
File "/root/hacking/yarGen-0.21.2/yarGen.py", line 1912, in load
object = pickle.loads(buffer)
File "/usr/lib64/python2.7/pickle.py", line 1382, in loads
return Unpickler(file).load()
File "/usr/lib64/python2.7/pickle.py", line 858, in load
dispatchkey
KeyError: '{'
...
...
[+] Loading ./dbs/good-exports-part9.db ...
Traceback (most recent call last):
File "yarGen.py_0.21.1", line 2407, in
good_exports_pickle = load(get_abs_path(filePath))
File "yarGen.py_0.21.1", line 1912, in load
object = pickle.loads(buffer)
File "/usr/lib64/python2.7/pickle.py", line 1382, in loads
return Unpickler(file).load()
File "/usr/lib64/python2.7/pickle.py", line 858, in load
dispatchkey
KeyError: '{'
[E] Missing goodware imphash/export databases. Please run 'yarGen.py --update' to retrieve the newest database set.
[E] Error - no goodware databases found. Please run 'yarGen.py --update' to retrieve the newest database set.

[... yarGen-0.21.2]# ll ./dbs/good-imphashes-part6.db
-rw-r--r--. 1 root root 715 Dec 15 01:15 ./dbs/good-imphashes-part6.db
[... yarGen-0.21.2]# ll ./dbs/good-exports-part9.db
-rw-r--r--. 1 root root 44 Dec 15 01:08 ./dbs/good-exports-part9.db

[... dbs]# pwd
/root/hacking/yarGen-0.21.2/dbs

[... dbs]# ll
total 785860
-rw-r--r--. 1 root root 1555146 Dec 15 00:35 good-exports-part1.db
-rw-r--r--. 1 root root 1238335 Dec 15 00:35 good-exports-part2.db
-rw-r--r--. 1 root root 1656390 Dec 15 00:35 good-exports-part3.db
-rw-r--r--. 1 root root 1234219 Dec 15 01:23 good-exports-part4.db
-rw-r--r--. 1 root root 2570093 Dec 15 01:16 good-exports-part5.db
-rw-r--r--. 1 root root 74998 Dec 15 00:55 good-exports-part6.db
-rw-r--r--. 1 root root 1441009 Dec 15 00:37 good-exports-part7.db
-rw-r--r--. 1 root root 151144 Dec 15 00:35 good-exports-part8.db
-rw-r--r--. 1 root root 44 Dec 15 01:08 good-exports-part9.db
-rw-r--r--. 1 root root 32275 Dec 15 00:35 good-imphashes-part1.db
-rw-r--r--. 1 root root 21598 Dec 15 01:29 good-imphashes-part2.db
-rw-r--r--. 1 root root 81395 Dec 15 01:16 good-imphashes-part3.db
-rw-r--r--. 1 root root 51426 Dec 15 01:08 good-imphashes-part4.db
-rw-r--r--. 1 root root 148364 Dec 15 00:35 good-imphashes-part5.db
-rw-r--r--. 1 root root 715 Dec 15 01:15 good-imphashes-part6.db
-rw-r--r--. 1 root root 72273 Dec 15 01:08 good-imphashes-part7.db
-rw-r--r--. 1 root root 3906 Dec 15 00:35 good-imphashes-part8.db
-rw-r--r--. 1 root root 52 Dec 15 01:01 good-imphashes-part9.db
-rw-r--r--. 1 root root 40036608 Dec 15 01:23 good-opcodes-part1.db
-rw-r--r--. 1 root root 37660735 Dec 15 00:39 good-opcodes-part2.db
-rw-r--r--. 1 root root 86709267 Dec 15 01:08 good-opcodes-part3.db
-rw-r--r--. 1 root root 36161350 Dec 15 01:20 good-opcodes-part4.db
-rw-r--r--. 1 root root 236825132 Dec 15 00:55 good-opcodes-part5.db
-rw-r--r--. 1 root root 19250880 Dec 15 00:37 good-opcodes-part6.db
-rw-r--r--. 1 root root 75104057 Dec 15 01:15 good-opcodes-part7.db
-rw-r--r--. 1 root root 7902850 Dec 15 01:02 good-opcodes-part8.db
-rw-r--r--. 1 root root 44 Dec 15 01:29 good-opcodes-part9.db
-rw-r--r--. 1 root root 24735115 Dec 15 00:57 good-strings-part1.db
-rw-r--r--. 1 root root 27676540 Dec 15 01:25 good-strings-part2.db
-rw-r--r--. 1 root root 57603741 Dec 15 01:29 good-strings-part3.db
-rw-r--r--. 1 root root 39814980 Dec 15 01:11 good-strings-part4.db
-rw-r--r--. 1 root root 67488946 Dec 15 01:01 good-strings-part5.db
-rw-r--r--. 1 root root 7909572 Dec 15 01:02 good-strings-part6.db
-rw-r--r--. 1 root root 24281798 Dec 15 00:35 good-strings-part7.db
-rw-r--r--. 1 root root 5137210 Dec 15 00:36 good-strings-part8.db
-rw-r--r--. 1 root root 14745 Dec 15 01:15 good-strings-part9.db

rules coverage

Hello,
do you have in mind something able to compare the different generated rules in order to show the overall overlap?

Cannot create databases (0.23.3 release has 0.23.2 code)

The commit that fixes #27 increments the version number to 0.23.3, but the 0.23.3 release itself still has the code from 0.23.2. This makes creating a database not possible.

Expected Behavior

Successful build of local good databse.

Current Behavior

Script fails withmemoryview: a bytes-like object is required, not 'str'

Possible Solution

Re-upload of release 0.23.3 containing the code base immediately after commit f582

Steps to Reproduce

wget https://github.com/Neo23x0/yarGen/archive/0.23.3.zip
unzip -d yarGen 0.23.3.zip
cd yarGen
cd yarGen-0.23.3
python3 ./yarGen.py --update #can also do "mkdir dbs" instead to save time
mkdir sample
echo "int main(){return 0;}" > test.c
gcc -o ./sample/test.elf ./test.c
python3 ./yarGen.py -g ./sample -c -i sample --opcodes

Context (Environment)

Ubuntu 20.04 LTS

possible problem when no permission to read files

@Neo23x0 Hi, think i have found some real issue(s) this time.

  1. Tried to create super rule for 15 similar executables (at least the strings does) and it created only 6.
    i looked at the log and saw 9 lines like this:

[+] Processing /home/remnux/Desktop/winexec/cafb416560f61a7812917638fd6b0657403c10a5bfdcb8d9d2e26db89b3040e0.bin ...
[-] Skipping strings/opcodes from /home/remnux/Desktop/winexec/cafb416560f61a7812917638fd6b0657403c10a5bfdcb8d9d2e26db89b3040e0.bin due to MD5 duplicate detection
[+] Processing /home/remnux/Desktop/winexec/4df547b75a382539fa41c711d6f4b11f144267f9715ae3e0dda43e0b51623326.bin ...
[-] Skipping strings/opcodes from /home/remnux/Desktop/winexec/4df547b75a382539fa41c711d6f4b11f144267f9715ae3e0dda43e0b51623326.bin due to MD5 duplicate detection

i noticed the file date is funny on those 9 files that were skipped, and since they were extracted from zip, i decided to try to change their permissions.
ran again, and it worked:

[+] Processing /home/remnux/Desktop/winexec/cafb416560f61a7812917638fd6b0657403c10a5bfdcb8d9d2e26db89b3040e0.bin ...
[-] Extracting Strings: /home/remnux/Desktop/winexec/cafb416560f61a7812917638fd6b0657403c10a5bfdcb8d9d2e26db89b3040e0.bin
148 ASCII strings extracted
1 ASCII strings extracted
[+] Processed /home/remnux/Desktop/winexec/cafb416560f61a7812917638fd6b0657403c10a5bfdcb8d9d2e26db89b3040e0.bin Size: 19456 Strings: 222 OpCodes: 0 ...

so it looks like it skips the files for the wrong reason.

  1. im not good at math, but the numbers on the output above doesn’t seem to match, could be wrong print, or an issue, or just not enough detailed debug info (why the 2nd print of 1 ascii string?)

  2. im quite sure this is intended, but my output didn’t had a super rule, although all the 15 files
    had a full match on 1 particular substring they all shared, although the full strings varied.
    for example:

file1:
$s1 = aabrakadabra
$s2 = aaa

file2:
$s1 = babrakadabra
$s2 = bbb

file3:
$s1 = cabrakadabra
$s2 = ccc

so the full strings dont match at all.

but a super rule with:

$s1 = abrakadabra

would catch them all.

again, not sure if this intended or not, if not, might be a very nice improvement.

Error in getting update for dbs

###############################################################################
Downloading good-strings-part7.db from https://www.bsk-consulting.de/yargen/good-strings-part7.db ...
Error while downloading the database file - check your Internet connection
Alterntive download link: https://drive.google.com/drive/folders/0B2S_IOa0MiOHS0xmekR6VWRhZ28
Download the files and place them into the ./dbs/ folder

The alternative download link for the drive has no content. it says: This folder is in the owner's trash
To view this folder, ask the owner (Florian) to restore it.

pip issues

[Linux][Kali]

TLDR : FIX

If you got a
AttributeError: 'module' object has no attribute 'PY2'

Just try to update PIP
wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py
rm get-pip.py
sudo pip install scandir lxml naiveBayesClassifier pefile

Error

pip raise an Exception for scandir lxml naiveBayesClassifier
pip install scandir
Downloading/unpacking scandir
Downloading scandir-1.2.tar.gz
Cleaning up...
Exception:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 122, in main
[...]
File "/usr/lib/python2.7/dist-packages/pip/req.py", line 285, in setup_py
if six.PY2 and isinstance(setup_py, six.text_type):
AttributeError: 'module' object has no attribute 'PY2'
`

Unworking fixes

Note that, in order to avoid errors, you may want to use :
sudo pip install pefile
Then run
sudo pip3 install scandir lxml naiveBayesClassifier

_But this raise errors if you run yaraGen _

with python2
ImportError: No module named scandir

with python3
except Exception, e:
SyntaxError: invalid syntax

I searched for conflict between pip and pip3, runned apt-get update && apt-get upgrade but the issue was still there.
So i checked the official pip website and checked my pip version, which wasn't up to date.
The Upgrade made the trick.

Overall rules for Malware Families?

I noticed when I scanned 14 exe's all of the same malware family it outputted a rule for each EXE and none of the detections were that similar. Is there a way to create an overall rule based on the matching opcodes/strings between the large amount of exes so the full malware family is detected instead of having specific ones for each exe?

This may just be my ignorance on using the script.
It seems inefficient to not do a single rule for the malware family.
Both of these exe's are part of the RacoonStealer Family.

Example from two of the generated rules:

rule sig_1f9bd27fd7591a98afd67499ae6730eb56c137335d283892bc06b7ab2241ed6c {
   meta:
      description = "malware - file 1f9bd27fd7591a98afd67499ae6730eb56c137335d283892bc06b7ab2241ed6c.exe"
      author = "Babyhamsta"
      reference = "https://github.com/Neo23x0/yarGen"
      date = "2023-12-11"
      hash1 = "1f9bd27fd7591a98afd67499ae6730eb56c137335d283892bc06b7ab2241ed6c"
   strings:
      $s1 = "HGDI32.dll" fullword ascii
      $s2 = "ACDSeeQVUltimate15.exe.dll" fullword wide
      $s3 = "         <requestedExecutionLevel level='asInvoker' uiAccess='false'/>" fullword ascii
      $s4 = "nhjnjK:\"V" fullword ascii
      $s5 = "\\QHdll_." fullword ascii
      $s6 = "* }5Q^" fullword ascii
      $s7 = "* _m3`Q" fullword ascii
      $s8 = "TaCe+ l" fullword ascii
      $s9 = "wqC.Qot" fullword ascii
      $s10 = "1Wd.BEy" fullword ascii
      $s11 = "F:\"0V1" fullword ascii
      $s12 = "7Xpq.ssc" fullword ascii
      $s13 = "6aF.OCB}cR" fullword ascii
      $s14 = "rct3s:\\" fullword ascii
      $s15 = "/L:\"l?'Y" fullword ascii
      $s16 = " R:\"*b" fullword ascii
      $s17 = "zo:\"jhhs`" fullword ascii
      $s18 = "]SE:\\e" fullword ascii
      $s19 = "AAT:\\," fullword ascii
      $s20 = "D:\\ X{" fullword ascii

      $op0 = { 8b 34 b5 20 4e f6 00 66 3b fb f8 66 85 ed c1 e8 }
      $op1 = { 66 c1 e2 f5 c1 ca 02 8d b6 ff ff ff ff f7 c7 d7 }
      $op2 = { 89 01 8d bf fc ff ff ff 66 8b d7 c0 da 5a 8b 17 }
   condition:
      uint16(0) == 0x5a4d and
      ( 8 of them and all of ($op*) )
}

rule sig_6b7bb7ed7e486cdc4e3a1d67a598aeee5a74e3c58f94e48e5fa626d6562f8688 {
   meta:
      description = "malware - file 6b7bb7ed7e486cdc4e3a1d67a598aeee5a74e3c58f94e48e5fa626d6562f8688.exe"
      author = "Babyhamsta"
      reference = "https://github.com/Neo23x0/yarGen"
      date = "2023-12-11"
      hash1 = "6b7bb7ed7e486cdc4e3a1d67a598aeee5a74e3c58f94e48e5fa626d6562f8688"
   strings:
      $s1 = "BladesGray.exe" fullword wide
      $s2 = "gunshot.exe" fullword wide
      $s3 = "Kozatipepikici laci. Canogoz pupaho. Jil xofiroj xokur xisidukuy. Sesecigo bipaxuh nuvu. Roladig. Gayoyir mil. Daxafoxa mik. Lez" ascii
      $s4 = "22222222222222222222222226" ascii /* hex encoded string '""""""""""""&' */
      $s5 = "222222222222222222222C" ascii /* hex encoded string '"""""""""",' */
      $s6 = "2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222" ascii /* hex encoded string '""""""""""""""""""""""""""""""""""""""""""""' */
      $s7 = "22222222222222222222222222222222222222" ascii /* hex encoded string '"""""""""""""""""""' */
      $s8 = "222222222C" ascii /* hex encoded string '"""",' */
      $s9 = "222222222222222222222222222222222222222222" ascii /* hex encoded string '"""""""""""""""""""""' */
      $s10 = "22222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222" ascii /* hex encoded string '""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""&' */
      $s11 = "2222222222222222222222222222222222222222222222" ascii /* hex encoded string '"""""""""""""""""""""""' */
      $s12 = "62222222222222222222222222" ascii /* hex encoded string 'b""""""""""""' */
      $s13 = "222222222222222222222222222222222222" ascii /* hex encoded string '""""""""""""""""""' */
      $s14 = "22222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222" ascii /* hex encoded string '""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""' */
      $s15 = "22222222222222222222222222222222222222222222222222222222222222222222222226" ascii /* hex encoded string '""""""""""""""""""""""""""""""""""""&' */
      $s16 = "6222222222222222222222" ascii /* hex encoded string 'b""""""""""' */
      $s17 = "4.1.61.50" fullword wide /* hex encoded string 'AaP' */
      $s18 = "ijeye. Minucivuxiyupor kafihizeyokocu piyit. Temezoto zebenuxokeyosop. Patikinehunalo hej fexu. Piv zegosob deti jovisodidicoyam" ascii
      $s19 = "zeposureyutazajimevusayunu xetoneya beyeziyumosalasagasilumepiz cevikotevokoyufajozebupoge" fullword wide
      $s20 = "FileDescriptions" fullword wide

      $op0 = { 83 ff ff ff db ff ff ff 35 }
      $op1 = { eb 04 83 65 e0 00 8b 45 e0 e8 96 34 00 00 c3 83 }
      $op2 = { 33 c0 8b 4d fc 5f 5e 33 cd 5b e8 34 3c 00 00 c9 }
   condition:
      uint16(0) == 0x5a4d and
      ( 8 of them and all of ($op*) )
}

Unable to create/update goodware db on windows (SSL verify fail)

Expected Behavior

Successful build of local goodware database.

Current Behavior

Script fails with error SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)
It's caused by the rootCA management mechanism ( https://bugs.python.org/issue36011 )

Solution

install certifi (mozilla rootCA project https://github.com/certifi/python-certifi ) with pip ( python -m pip install certifi ) or include certifi in the requirements.txt file

Context (Environment)

Windows 10

pip3 command misleading

In your README, it says to try using pip3 to install dependencies:

sudo pip3 install scandir lxml naiveBayesClassifier

This is misleading. Your yarGen code does not support python3. Should remove that and make it clear that python3 is not supported.

TypeError when creating new goodware strings database

I tried to create new goodware strings database from the directory containing Mach-O samples

[+] Creating local database ...
[+] Using './dbs/good-strings-mac_os.db' as filename for newly created strings database
[+] Using './dbs/good-opcodes-mac_os.db' as filename for newly created opcodes database
[+] Using './dbs/good-imphashes-mac_os.db' as filename for newly created opcodes database
[+] Using './dbs/good-exports-mac_os.db' as filename for newly created opcodes database
Traceback (most recent call last):
  File "yarGen.py", line 2288, in <module>
    save(good_json, strings_db)
  File "yarGen.py", line 1870, in save
    file.write(json.dumps(object))
  File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/gzip.py", line 276, in write
    data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'

cannot save opcode database

change opcode save from
opcodes.append(binascii.hexlify(text_part[:16]))
to
opcodes.append(text_part[:16].hex())
would work perfectly because json cannot dump dict whose key type is bytes

0 rules being generated

When I use the latest commit (0311261) 0 simple or super rules are being generated.

$ python yarGen.py -a "Sean Whalen" -m /opt/malware/droppers/upatre -o upatre.yar

                    ______
  __  ______ ______/ ____/__  ____
 / / / / __ `/ ___/ / __/ _ \/ __ \
/ /_/ / /_/ / /  / /_/ /  __/ / / /
\__, /\__,_/_/   \____/\___/_/ /_/

/____/

Yara Rule Generator
by Florian Roth
July 2015
Version 0.15.0 beta

[+] Processing PEStudio strings ...
[+] Reading goodware strings from database 'good-strings.db' and 'good-opcodes.db' ...
(This could take some time and uses up to 4 GB of RAM)
[+] Initializing Bayes Filter ...
[-] Training filter with good strings from ./lib/good.txt
[+] Processing malware files ...
[+] Processing /opt/malware/droppers/upatre/app_details.exe ...
[-] Extracting Strings: /opt/malware/droppers/upatre/app_details.exe
[-] Extracting OpCodes: /opt/malware/droppers/upatre/app_details.exe
[+] Processing /opt/malware/droppers/upatre/application_features.exe ...
[-] Extracting Strings: /opt/malware/droppers/upatre/application_features.exe
[-] Extracting OpCodes: /opt/malware/droppers/upatre/application_features.exe
[-] Skipping strings/opcodes from /opt/malware/droppers/upatre/application_features.exe due to MD5 duplicate detection
[+] Processing /opt/malware/droppers/upatre/application_information.exe ...
[-] Extracting Strings: /opt/malware/droppers/upatre/application_information.exe
[-] Extracting OpCodes: /opt/malware/droppers/upatre/application_information.exe
[-] Skipping strings/opcodes from /opt/malware/droppers/upatre/application_information.exe due to MD5 duplicate detection
[+] Processing /opt/malware/droppers/upatre/cadvahin.exe ...
[-] Extracting Strings: /opt/malware/droppers/upatre/cadvahin.exe
[-] Extracting OpCodes: /opt/malware/droppers/upatre/cadvahin.exe
[+] Processing /opt/malware/droppers/upatre/Transfer_blocked.exe ...
[-] Extracting Strings: /opt/malware/droppers/upatre/Transfer_blocked.exe
[-] Extracting OpCodes: /opt/malware/droppers/upatre/Transfer_blocked.exe
[+] Generating statistical data ...
[+] Generating Super Rules ... (a lot of foo magic)
[+] Generating simple rules ...
[-] Applying intelligent filters to string findings ...
[+] Generating super rules ...
[=] Generated 0 SIMPLE rules.
[=] Generated 0 SUPER rules.
[=] All rules written to upatre.yar
sean@sandbox:~/yarGen$ cat upatre.yar
/*
Yara Rule Set
Author: Sean Whalen
Date: 2015-08-12
Identifier: upatre
*/

/* Global Rule -------------------------------------------------------------- /
/
Will be evaluated first, speeds up scanning process, remove at will */

global private rule gen_characteristics {
condition:
uint16(0) == 0x0000 and filesize < 201KB
}

/* Rule Set ----------------------------------------------------------------- */

/* Super Rules ------------------------------------------------------------- */

Error when running yargen

image

python yarGen.py -c --opcodes -i office -g '/root/Downloads/vivaldi-stable_1.12.955.36-1_i386.deb'
Traceback (most recent call last):
File "yarGen.py", line 25, in
from naiveBayesClassifier import tokenizer
ImportError: No module named naiveBayesClassifier

python3 yarGen.py --update
File "yarGen.py", line 33
except Exception, e:
^
SyntaxError: invalid syntax

I am not sure what the issue is, I made sure that the package naiveBayesClassifier was installed

why use imphash for elf file

imphash is an effective way to identify binary files. In yarGen, I see imphash generated for pe file. But, why do generate imphash for elf files?
how to generate imphash for elf files by python?

running on linux

Hi Florian,

I gave yarGen a try on my Ubuntu, it worked well, but i have two things to address:

  1. In your installation guide in the read-me you suggest installing pickle, in my case it couldn’t find such package, but yarGen worked without it.

  2. i got a nice warning message suggesting i should download PEStudio to improve speed, but there is no PEStudio for Linux yet, suggestions?

db files recognition issues

1
2
I unziped good_opcodes.db.zip and good-strings.db in yarGen-master folder.
But yarGen.py can't recognize db files like picture.
Please help

Error in YARA rules created

Hi I tried to create rule for two malware sample and it gives me error while testing them in Hybird analysis. As I am new to the YARA so could not understand why the error is, tried to read the YARA documentation also but could not understand the error cause.

Yara rule is pasted in this link :

https://pastebin.com/UTwFdCSF

Can you help to with a error free version of the same rule so that I can learn the errors. Thanks

use for apk

I would like to ask is this tool available for Android APK?

Dropzone: specific filename for every YARA rule

Hi there!

I am new to yarGen, sorry for potentially noobish question.

Is there a possibility to specify different output file for YARA rule for each file?

I am trying to built an automated YARA rule creation with dropzone (files dropping there randomly), but the file yargen_rules.yar is being overwritten after a file is processed.

Or is it just designed to drop one file at a time and collect the rule? I could then build some queueing around it.

Thanks in advance.

Superrule generation logic

Hi,
Is it possible that the super-rule generation condition is inverted?
Inspecting the code, there's a condition that practically (unless there're at least 20 rules by default) stops the code from generating a super-rule (...if len(combinations[combi]["strings"]) >= int(args.rc):..).
Inverting the condition, allows the script to generate super-rules. It seems to me that args.rc, as described by the documentation shouldn't be playing that role in that condition.
I do agree that simply inverting the condition, might be a simplistic approach and that, of course, I just landed in the code and did not try it throughly, and will be more likely missing sth.
But reality is that the script does not generate super-rules even in cases where it should (I don't expect a super-rule covering dozens of (similar) samples to have more than 20 rules).
And as I stated before, it seems args.rc is being used with two different meanings (maximum number of strings for simple rules, (sort of a) minimum for super-rules.
Pls, forget me if I made a basic mistake (I'm not proficient in Python).
Thanks in advance for your help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.