inquest / threatingestor Goto Github PK

Extract and aggregate threat intelligence.

Home Page: https://inquest.readthedocs.io/projects/threatingestor/

License: GNU General Public License v2.0

Python 99.38% Dockerfile 0.23% Shell 0.39%

ioc indicators-of-compromise threatintel threat-intelligence osint dfir malware-research security-tools threat-sharing threat-feeds

threatingestor's Introduction

ThreatIngestor

An extendable tool to extract and aggregate IOCs from threat feeds.

Integrates out-of-the-box with ThreatKB and MISP, and can fit seamlessly into any existing workflow with SQS, Beanstalk, and custom plugins.

Currently used by InQuest Labs IOC-DB: https://labs.inquest.net/iocdb

Overview

ThreatIngestor can be configured to watch Twitter, RSS feeds, sitemap (XML) feeds, or other sources, and extract meaningful information such as malicious IPs/domains and YARA signatures, and send that information to another system for analysis.

Try it out now with this quick walkthrough, read more ThreatIngestor walkthroughs on the InQuest blog, and check out labs.inquest.net/iocdb, an IOC aggregation and querying tool powered by ThreatIngestor.

Installation

ThreatIngestor requires Python 3.6+, with development headers.

Install ThreatIngestor from PyPI:

pip install threatingestor

Install optional dependencies for using some plugins, as needed:

pip install threatingestor[all]

View the full installation instructions for more information.

Usage

Create a new config.yml file, and configure each source and operator module you want to use. (See config.example.yml for layout.) Then run the script:

threatingestor config.yml

By default, it will run forever, polling each configured source every 15 minutes.

If you'd like to run the image extraction source, or include the image extraction functionality for other sources, you will need to be running Python 3.7 >= due to the dependencies:

pip install opencv-python pytesseract numpy

View the full ThreatIngestor documentation for more information.

Plugins

ThreatIngestor uses a plugin architecture with "source" (input) and "operator" (output) plugins. The currently supported integrations are:

Sources

Operators

View the full ThreatIngestor documentation for more information on included plugins, and how to create your own.

Threat Intel Sources

Looking for some threat intel sources to get started? InQuest has a Twitter List with several accounts that post C2 domains and IPs: https://twitter.com/InQuest/lists/ioc-feed. Note that you will need to apply for a Twitter developer account to use the ThreatIngestor Twitter Source. Take a look at config.example.yml to see how to set this list up as a source.

For quicker setup, RSS feeds can be a great source of intelligence. Check out this example RSS config file for a few pre-configured security blogs.

Support

If you need help getting set up, or run into any issues, feel free to open an issue. You can also reach out to @InQuest on Twitter or read more about us on the web at https://www.inquest.net.

We'd love to hear any feedback you have on ThreatIngestor, its documentation, or how you're putting it to work for you!

Contributing

Issues and pull requests are welcomed. Please keep Python code PEP8 compliant. By submitting a pull request you agree to release your submissions under the terms of the LICENSE.

Docker

Production

A Dockerfile is available for running ThreatIngestor within a Docker container.

First, you'll need to build the container:

docker build . -t threatingestor

After that, you can mount the container by using this command:

docker run -it --mount type=bind,source=/,target=/dock threatingestor /bin/bash

After you've mounted the container and you're inside the /bin/bash shell, you can run ThreatIngestor like normal:

threatingestor config.yml

Development

There is also a Dockerfile.dev for building a development version of ThreatIngestor. All you need is an available .whl file, which can be generated with the following command:

python3 -m build

After you've built the project, you can build the container:

docker build . -t threatingestor -f Dockerfile.dev

NOTE: If you run into any issues while building the development environment or running ThreatIngestor within the container, you may need to comment out the following lines in Dockerfile.dev to work properly:

FROM ubuntu:18.04
...
# RUN apt-get install tesseract-ocr -y
...
# RUN pip3 install opencv-python pytesseract numpy
...

Extra Scripts

Some scripts are now provided to help with your local configuration of ThreatIngestor.

A README.md with additional information is available here.

threatingestor's People

Contributors

Stargazers

Watchers

Forkers

dragonmantank 4n6strider reanimat0r crackercat flatl1neapt rvaughan threatpage 3453-315h shaiful-systema rshipp heikipikker cybermonitor beerandgin jessonfoo mrivero-mwb houey sasqwatch crucesx00 vantagepointsecurity-danny awesome-security bigbrobro willymac tquentin azrara dirtypipe jack51706 y0d4a rtbn dskho b-xiang ares2013 sh9369 teamv quinn-yan larrycameron80 jay-carter jszurdi alirezachegini jigsawsecurity data-gram forkarepo blue-infosec pabloizq96 deepak1104 imuledx seaseemy akju iiiiiiit 5l1v3r1 rzbh63 kp-forks shankaraman benhe119 yuduanchen dmmca ssprasath imendax ninoseki blueteamzone ethancunt kalptarunet xmppadmin zombie-technology ratherbland dekoder pu55yf3r zlzhangv007 mkulasi chaitanyakrishna hartl3y94 avgirl ctimita actorexpose ekmixon syllogy dev-talha-hashmi gotorland wikijm svch0stz aurorafang threatwise savobit israelccarvalho pagosasia osinter-project marcinsobala originalpwnster mosheh-di fizzmosh bkaathi mathevsval laciberreserva 99ttz kristophercasey huashanqitu kkkkkk011 ls95 darksidesfear vavarachen fangod

threatingestor's Issues

Add mentions support to Twitter source

Twitter source needs the ability to read from Twitter mentions API for the authenticated user.

Add STIX/TAXII input/output

Create a queue-based source/operator that can be used as a simpler alternative to SQS

Like SQS, but for named pipes.

This just makes it easier for people to get set up without needing an AWS account and SQS tubes already created.

~~Base off of #23.~~ Rebase on top of #23 once that issue is done. I think this one is an easier starting point so we don't have to worry about SQS.

Basic Design Proposal

Source

New FIFO source with one required param: filename.
Each line read from the pipe is considered one "job".
Each job is parsed as JSON with escaped newlines.
Until #23 is done, each job should have 1 required key, content.
Treat content values as plain text and extract artifacts as normal.
Saved state will always be None, since FIFOs maintain their own state.

Operator

New FIFO operator with one required param: filename, and N optional dynamic params.
Copy logic for dynamic JSON creation from SQS operator.
Encode contents to escape newlines, so each job is always only 1 line.
Write each job to the pipe.

Give a better error message when Twitter operator status is invalid

Check if it's a string at init, and raise an IngestorError with a readable message if it's not.

Add Git source plugin

Add a Git source plugin, for pulling down repositories and checking the new files for YARA rules.

Write subprocess-git functions for the git commands we'll need
Use the hash of HEAD for the saved_state
Get new files with diff --name-only -- $saved_state
Any new files that match .{yar,yara,rule,rules}, run through the YARA regex, and extract individual rules

Github source doesn't iterate through results

While the top of the request shows the number of entries, that is not how many are returned -- results are paginated. Adding the parameter per_page=100 sets the maximum return, and page= goes through all the results. At minimum we should be scraping the maximum of the page, however it's up for question whether we really want all the results if the query is vague

Add abstract JSON source/operator

Design and implement a source
Write an operator using the existing interpolation code from SQS
Rebase SQS source to extend from this.
Rebase SQS operator to extend from this.

Publish ThreatKB library and add as a dependency

Publish the "threatkb" Python API as a library on PyPi, and add it as a dependency to requirements.txt here. This is a hard requirement for initial release. Be sure to update the threatkb library and the ThreatKB operator here to support API-key auth instead of user-password.

Clarify Twitter Credentials

The Twitter credentials in the config.yml and documentation can be more clear:

Instead of the current

- name: twitter-auth
    # https://dev.twitter.com/oauth/overview/application-owner-access-tokens
    token:
    token_key:
    con_secret_key: 
    con_secret:

This would be more clear, as it uses the same terms and order of keys that Twitter presents

- name: twitter-auth
    # https://dev.twitter.com/oauth/overview/application-owner-access-tokens
    api_key:
    api_key_secret:
    access_token: 
    access_token_secret:

Use sqlite for saved_state

Writing the config every time we update the state is messy. With #39, this becomes even worse, since when we write out the config from the internal dict, it's not guaranteed to be in the same order we read it in.

As a cleaner implementation, change to reading/writing config from a sqlite database instead.

Basic design proposal:

Config support for a new general variable, state_file, relative or absolute path to sqlite db.
State class with get(name) and save(name).
SQLite db with a table with name, state columns.

Extend SQS operator to support more artifact types

Add support for all existing artifact types.

Handle 'git' dependency

Error out the Git source in a graceful way when git isn't installed, and document the optional dependency in the installation docs.

Improve observability

Need to be able to see what's going on internally, in order to improve results, track bugs more easily, and provide summary reporting.

Look into adding a Keyword artifact

For config-based keyword/regex extraction on existing sources.

Add Task artifact type

Need a new artifact type for use with ThreatKB. Tasks would include things that can't be treated as a YARA rule or C2 hit - like manually investigating a CVE or an interesting blog post.

~~#12~~
~~Add Task support to ThreatKB library~~
Add Task artifact class
Add handle_task method to ThreatKB operator
Add Task as a supported artifact to applicable Sources

Required for #4

Exclude reserved IPs from artifact results

Check is_reserved on the ipaddress object.

Add MISP operator

Export artifacts to a MISP instance, using PyMISP.

Add single-table MySQL operator

Use this https://pypi.org/project/PyMySQL/ so we don't have to handle extra system dependencies.

Use a single table with an artifact_type column because that means the table can be defined in the config.

Add 'extras' folder

Add SQS workers, web UI, and anything else that doesn't fit into the plugin architecture.

Document everything in here in a new section in the docs.

Ideas:

File watcher (#34) queue worker
Tiny web UI
Link download queue worker

Fix URL "defanging" for encoded URLs

Currently, URLs that come from iocextract as hex-encoded, url-encoded, etc are not being "refanged" (decoded) correctly. Need to handle these properly.

Allow operator artifact_types to be user-configured

Instead of hardcoding artifact_types in each operator, pass this through from the configuration files and just use the original list as a default if nothing is provided.

Add artifact_types as a global const in config.py
Parse as a comma-separated list with stripped whitespace: IPAddress, URL, YARASignature
Pass in to operators as a list of Artifact classes
If the optarg artifact_types received by an operator is None, fall back to the default hardcoded list in that operator

Add Twitter hashtags as tags in ThreatKB

When importing a C2 hit to ThreatKB from Twitter, include hashtags as tags.

This requires the following changes:

Add tags member to Artifact model
Extract hashtags (tweet['entities']['hashtags']) from tweets in Twitter source plugin
Push artifact.tags as tags to ThreatKB from ThreatKB operator plugin

Add Hash artifact type

Extract md5/shasums as a new Artifact type.

Add Hash artifact class
- Include a method for returning the hash type based on the length
Extract hashes in sources.process_element
Add support for the new artifact to CSV operator

Add tests for new Plugin architecture

Test the new architecture introduced in 77f6cda, including the new exception handling.

Add condition support to operators

Add some new fields to operator configuration sections to allow more flexible use of operators. This will open ThreatIngestor up to run multiple discrete tasks (e.g. send Twitter "open directory" results to a crawler, and send Twitter List c2 hits to ThreatKB) from a single instance and single config file.

Add support for the following fields:

allowed_sources: Comma-separated, whitespace-stripped, wildcard-supported list of source definitions (e.g. twitter-c2-list,rss-*). Only artifacts from these sources are sent to the operator.
artifact_conditions: Comma-separated, whitespace-stripped list of predefined conditions (e.g. disallow_ip that would only let through URL artifacts if they use a FQDN and not an IP address).

Additionally:

Add a new Conditions class, all conditions will have a function that can be passed an Artifact and return True or False. If True, the artifact will be processed, otherwise it will be skipped.
Document how to create Conditions classes to extend the tool, similar to how Source and Operator modules are described in the README.

Figure out how to handle twitter links

There's not a good way right now to handle twitter t.co links vs defanged links. Decide on the best way to do this, either on the twitter end or via operator filters.

Fix Python3 compat issues

Artifacts uses some magic functions that break on python3. Reimplement to allow python3 support.

Handle malformed input errors better

There are some situations where unexpected input (like weirdly formed RSS content) causes the program to crash.

We should just do best-effort parsing in these cases. Unexpected input should never cause an uncaught exception; same with network errors/etc. If an input is bad, we can just log something and skip it.

This is a requirement for initial release.

Publish "crawlerlib" and add as a dependency

Need the ioc extraction portions of "crawlerlib" published, preferably as a library on PyPi.

We currently use the following features:

Have discussed possibly publishing an ioc extractor library with these pieces, and moving some of the pre/postprocessing implemented elsewhere in ThreatIngestor into that library as well.

Once this library is published, add it as a dependency to requirements.txt.

This is a hard dependency for initial release, as ThreatIngestor requires the IOC extraction to function.

Add generic WebAPI operator (webhooks)

Use this for testing: https://yeti-platform.readthedocs.io/en/latest/api.html

Want an operator that can post arbitrary json to arbitrary endpoints, so people can add new integrations without writing any code.

Base off of #23.

Add Sample artifact type

Add a new Sample artifact type to hold links to downloadable samples from app.any.run, hybrid-analysis, etc.

Add Sample artifact class
- Extend from URL class
- Fields: atrifact will be the original extracted url
- Str method: modify the original url as needed to produce direct download URL
Extract Sample artifacts in sources.process_element
- Include only nonobfuscated urls from the sample domain whitelist
Add support for new artifact to SQS and CSV operators

Add filesystem source

Point it at a file/directory and it will recursively read and extract artifacts similar to iocextract CLI.

Add Twitter operator plugin

Needs some level of customization. Hashtags per-artifact-type, additional text/hashtag, reference link, etc. Ability to quote-tweet reference tweets.

Fix operator plugin name handling

The operator plugins are getting a name keyword in their kwargs that shouldn't be there, likely as a result of the #39 changes. Figure out where this broke, and fix it.

Add Web source plugin

Add Web source plugin, for fetching plaintext/csv/etc threatfeeds and automatically extracting artifacts.

Use HTTP 304 as saved_state to cut down on duplicates (Last-Modified/If-Modified-Since/ETag/If-None-Match)

Rework SQS source

Right now all it can do is read a link, fetch its contents, and extract IOCs. Need to extend it to accept more reasonable inputs. Keep in mind we'll be using this as a core piece of the pipeline models.

Base off of #23.

Remove all Python 2 support

Too many Unicode issues. Remove all compatibility code for 2.7, and only support Python 3.6+.

Fix all code style issues.

Add SQLite operator

Store C2s in a SQLite database, for an easy, no-setup operator that's more convenient to actually use than CSV.

Proposed database layout

One table per artifact type: domain, hash, ipaddress, url, yarasignature, task.

Each table's schema can be the same:

artifact: text primary key
reference_link: text
reference_text: text
created_date: text (filled by datetime('now', 'utc'))
state: text (initially null, for external use only)

Extend Twitter support to extract t.co links

Unshorten the t.co URL to get the actual URL.

In Twitter source, use the tweet object to expand t.co links before sending off to process_element

Backfill unit tests

Write unit tests for existing code.

RSS Source

run_respects_saved_state
run_does_preprocessing_deobfuscation
run_respects_feed_type
run_supports_both_content_summary
run_supports_both_link_url
run_returns_top_item_date_as_saved_state
run_returns_artifacts_correctly

Twitter Source

init_detects_search_type
run_respects_saved_state
run_returns_newest_tweet_id_as_saved_state
run_supports_all_endpoints
run_returns_artifacts_correctly

ThreatKB Operator

handle_domain_creates_domain
handle_ipaddress_creates_ipaddress
handle_yarasignature_creates_yarasignature

SQS Operator

handle_url_discards_ip_urls
handle_url_passes_kwargs

Config

daemon_returns_main_daemon_bool
sleep_returns_main_sleep_int
sources_returns_list_of_all_source_tuples
sources_excludes_internal_options_from_kwargs
operators_returns_list_of_all_operator_tuples
operators_excludes_internal_options_from_kwargs
save_state_writes_saved_state
get_state_reads_saved_state

Source base class

Operator base class

default_artifact_types_is_empty
handle_artifact_raises_not_implemented
process_includes_only_artifact_types

Ingestor class

init_creates_sources_operators_dicts
run_checks_config_daemon
run_once_calls_run_process_save_state

Add defanged URL/domain/IP interpolation support to artifacts.

Depends on InQuest/iocextract#14.

Use YAML for config instead of INI

Solves the auth duplication issues and allowing multiple auths issue. Cleaner and still human readable/writable.

Depends on #44.

Final Design

general:
    sleep: 1500
    daemon: true

credentials:
  - name: twitter-myuser
    token: EXAMPLE
    token_key: EXAMPLE
    con_secret_key: EXAMPLE
    con_secret: EXAMPLE

sources:
  - name: twitter-open-directory
    credentials: twitter-myuser
    module: twitter
    saved_state: 
    q: '"open directory" #malware'

  - name: twitter-inquest-c2-list
    credentials: twitter-myuser
    module: twitter
    saved_state: 
    owner_screen_name: InQuest
    slug: c2-feed

operators:
  - name: mycsv
    module: csv
    filename: output.csv
    allowed_sources: [mysource, myothersource]
    filter: '([^\.]google.com$|google.com[^/])'
    artifact_types: [URL, Domain]

Old brainstorming info

Currently, config looks like this:

[source:twitter-inquest-c2-list]
module = twitter
saved_state = 
token = EXAMPLE
token_key = EXAMPLE
con_secret_key = EXAMPLE
con_secret = EXAMPLE
owner_screen_name = InQuest
slug = c2-feed

[source:twitter-open-directory]
module = twitter
saved_state = 
token = EXAMPLE
token_key = EXAMPLE
con_secret_key = EXAMPLE
con_secret = EXAMPLE
q = "open directory" #malware

When adding multiple plugins with the same module, you have to specify credentials each time, which is a hassle. I'd like to get rid of that requirement, and let people set up something that looks more like this:

twitter-myuser:
    module: twitter
    token: EXAMPLE
    token_key: EXAMPLE
    con_secret_key: EXAMPLE
    con_secret: EXAMPLE

    sources:
        twitter-open-directory:
            saved_state: 
            q: '"open directory" #malware'

        twitter-inquest-c2-list:
            saved_state: 
            owner_screen_name: InQuest
            slug: c2-feed

In this case, you only have to define the credentials once per account, and you can still use multiple Twitter accounts if desired, by creating a second base object, e.g.:

twitter-myotheruser:
    module: twitter
    token: EXAMPLE

...

I'm not sure how clean the implementation would be for this, so we should think about and map out a complete design for this to see if there's a better way to do it. For example, it might be better to have separate base sections for credentials, sources, and operators, and just include a reference to a certain named credential in each source or operator, something like this:

credentials:
    twitter-myuser:
        token: EXAMPLE
        token_key: EXAMPLE
        con_secret_key: EXAMPLE
        con_secret: EXAMPLE

sources:
    twitter-open-directory:
        credentials: twitter-myuser
        module: twitter
        saved_state: 
        q: '"open directory" #malware'

    twitter-inquest-c2-list:
        credentials: twitter-myuser
        module: twitter
        saved_state: 
        owner_screen_name: InQuest
        slug: c2-feed

Or to abstract it even further, and allow including any section in any other section:

twitter-myuser:
    module: twitter
    token: EXAMPLE
    token_key: EXAMPLE
    con_secret_key: EXAMPLE
    con_secret: EXAMPLE

source-twitter-open-directory:
    include: twitter-myuser
    saved_state: 
    q: '"open directory" #malware'

source-twitter-inquest-c2-list:
    include: twitter-myuser
    saved_state: 
    owner_screen_name: InQuest
    slug: c2-feed

This is the most flexible design, but I'm not sure whether I prefer having a defined source section, vs having the plugin type be based off the name (source-*, operator-*), vs having a key that defines the type (type: source). We need some way to differentiate sources and operators, since there can be duplication between them (e.g. SQS is both a source and an operator).

Moved from comment below.

Looking at some example YAML configs in other tools (Kubernetes, Ansible, etc), it seems having a list of items, each with a name key, is a common pattern. This makes me lean towards the second proposal above, with a slight modification:

sources:
  - name: twitter-open-directory
    credentials: twitter-myuser
    module: twitter
    saved_state: 
    q: '"open directory" #malware'

  - name: twitter-inquest-c2-list
    credentials: twitter-myuser
    module: twitter
    saved_state: 
    owner_screen_name: InQuest
    slug: c2-feed

I'm still not sure how to best handle credentials. The third proposal's "include" solution is more flexible, but I can't think of a case where you'd want to reuse anything that wasn't credentials. Having creds in their own named "credentials" section seems clearer for the end user, but doesn't accurately represent how the config parsing would be implemented, as you could feasibly define any parameters in the "credentials" section and reuse them for some other purpose.

Improve documentation

Add Pastebin source

Add support for additional deobfuscation techniques

Continue to add and improve C2 deobfuscation to catch more cases.

"ftx:" -> "ftp:"
"http__"
http:// test .com /url

ftx://test.com/doc.doc
http__www.clowndoc.com/KNpgJS/
http__co-story.co.kr/j59x7Q6/
http__delassociates.com/vXWS9G/
http__www.bagnismeraldo.com/hsVI1/
http__mkholidays.co.uk/GDYt/
http:// peekquick .com /sdeu/cr.sedin/sdac/

https://www.cisco.com/c/en/us/support/docs/security/email-security-appliance/118775-technote-esa-00.html

Add GitHub Search source plugin

Add GitHub Search source plugin, for checking e.g. CVE*** in newest repos. Hits would be imported as a Task artifact for use with ThreatKB.

Depends on #5

Add SQS as a Source module.

Add an SQS Source module. Will allow full-circle workflow. ThreatIngestor will classify/deobfuscate/filter input and send it to configured outputs. Doubles as SQS support for ThreatKB.

One example workflow:

Receive tweet https://twitter.com/_ddoxer/status/984080845056172034 in c2 list
Send pastebin link to SQS
SQS reader receives pastebin link, gets raw link, scrapes content
SQS reader sends content as a job with reference link of original pastebin link, to ThreatIngestor SQS Source
ThreatIngestor picks up job, processes, sends C2s to configured outputs

Update docs to reflect YAML config and statedb changes.

Update docs to reflect the new config layout from #39 and the statedb addition from #44.

Add auth to GitHub plugin

GitHub rate limits unauthenticated requests to 10/min. It's possible someone might want to search for more than 10 things, so we should build in (optional) auth support to the GitHub plugin to support that.