Coder Social home page Coder Social logo

inquest / threatingestor Goto Github PK

View Code? Open in Web Editor NEW
781.0 41.0 132.0 1.69 MB

Extract and aggregate threat intelligence.

Home Page: https://inquest.readthedocs.io/projects/threatingestor/

License: GNU General Public License v2.0

Python 99.38% Dockerfile 0.23% Shell 0.39%
ioc indicators-of-compromise threatintel threat-intelligence osint dfir malware-research security-tools threat-sharing threat-feeds

threatingestor's Introduction

ThreatIngestor

Build Status Documentation Status PyPI Version

An extendable tool to extract and aggregate IOCs from threat feeds.

Integrates out-of-the-box with ThreatKB and MISP, and can fit seamlessly into any existing workflow with SQS, Beanstalk, and custom plugins.

Currently used by InQuest Labs IOC-DB: https://labs.inquest.net/iocdb

Overview

ThreatIngestor can be configured to watch Twitter, RSS feeds, sitemap (XML) feeds, or other sources, and extract meaningful information such as malicious IPs/domains and YARA signatures, and send that information to another system for analysis.

ThreatIngestor flowchart with several sources feeding into multiple operators

Try it out now with this quick walkthrough, read more ThreatIngestor walkthroughs on the InQuest blog, and check out labs.inquest.net/iocdb, an IOC aggregation and querying tool powered by ThreatIngestor.

Installation

ThreatIngestor requires Python 3.6+, with development headers.

Install ThreatIngestor from PyPI:

pip install threatingestor

Install optional dependencies for using some plugins, as needed:

pip install threatingestor[all]

View the full installation instructions for more information.

Usage

Create a new config.yml file, and configure each source and operator module you want to use. (See config.example.yml for layout.) Then run the script:

threatingestor config.yml

By default, it will run forever, polling each configured source every 15 minutes.

If you'd like to run the image extraction source, or include the image extraction functionality for other sources, you will need to be running Python 3.7 >= due to the dependencies:

pip install opencv-python pytesseract numpy

View the full ThreatIngestor documentation for more information.

Plugins

ThreatIngestor uses a plugin architecture with "source" (input) and "operator" (output) plugins. The currently supported integrations are:

Sources

Operators

View the full ThreatIngestor documentation for more information on included plugins, and how to create your own.

Threat Intel Sources

Looking for some threat intel sources to get started? InQuest has a Twitter List with several accounts that post C2 domains and IPs: https://twitter.com/InQuest/lists/ioc-feed. Note that you will need to apply for a Twitter developer account to use the ThreatIngestor Twitter Source. Take a look at config.example.yml to see how to set this list up as a source.

For quicker setup, RSS feeds can be a great source of intelligence. Check out this example RSS config file for a few pre-configured security blogs.

Support

If you need help getting set up, or run into any issues, feel free to open an issue. You can also reach out to @InQuest on Twitter or read more about us on the web at https://www.inquest.net.

We'd love to hear any feedback you have on ThreatIngestor, its documentation, or how you're putting it to work for you!

Contributing

Issues and pull requests are welcomed. Please keep Python code PEP8 compliant. By submitting a pull request you agree to release your submissions under the terms of the LICENSE.

Docker

Production

A Dockerfile is available for running ThreatIngestor within a Docker container.

First, you'll need to build the container:

docker build . -t threatingestor

After that, you can mount the container by using this command:

docker run -it --mount type=bind,source=/,target=/dock threatingestor /bin/bash

After you've mounted the container and you're inside the /bin/bash shell, you can run ThreatIngestor like normal:

threatingestor config.yml

Development

There is also a Dockerfile.dev for building a development version of ThreatIngestor. All you need is an available .whl file, which can be generated with the following command:

python3 -m build 

After you've built the project, you can build the container:

docker build . -t threatingestor -f Dockerfile.dev

NOTE: If you run into any issues while building the development environment or running ThreatIngestor within the container, you may need to comment out the following lines in Dockerfile.dev to work properly:

FROM ubuntu:18.04
...
# RUN apt-get install tesseract-ocr -y
...
# RUN pip3 install opencv-python pytesseract numpy
...

Extra Scripts

Some scripts are now provided to help with your local configuration of ThreatIngestor.

A README.md with additional information is available here.

threatingestor's People

Contributors

battleoverflow avatar cmmorrow avatar deandrehall avatar elstamey avatar lil-bear avatar needmorecowbell avatar ninoseki avatar pedramamini avatar rshipp avatar shark4ce avatar vantagepointsecurity-danny avatar wikijm avatar willymac avatar ynvtlmr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

threatingestor's Issues

Use sqlite for saved_state

Writing the config every time we update the state is messy. With #39, this becomes even worse, since when we write out the config from the internal dict, it's not guaranteed to be in the same order we read it in.

As a cleaner implementation, change to reading/writing config from a sqlite database instead.

Basic design proposal:

  • Config support for a new general variable, state_file, relative or absolute path to sqlite db.
  • State class with get(name) and save(name).
  • SQLite db with a table with name, state columns.

Add auth to GitHub plugin

GitHub rate limits unauthenticated requests to 10/min. It's possible someone might want to search for more than 10 things, so we should build in (optional) auth support to the GitHub plugin to support that.

Handle 'git' dependency

Error out the Git source in a graceful way when git isn't installed, and document the optional dependency in the installation docs.

Add Web source plugin

Add Web source plugin, for fetching plaintext/csv/etc threatfeeds and automatically extracting artifacts.

  • Use HTTP 304 as saved_state to cut down on duplicates (Last-Modified/If-Modified-Since/ETag/If-None-Match)

Fix operator plugin name handling

The operator plugins are getting a name keyword in their kwargs that shouldn't be there, likely as a result of the #39 changes. Figure out where this broke, and fix it.

Allow operator artifact_types to be user-configured

Instead of hardcoding artifact_types in each operator, pass this through from the configuration files and just use the original list as a default if nothing is provided.

  • Add artifact_types as a global const in config.py
  • Parse as a comma-separated list with stripped whitespace: IPAddress, URL, YARASignature
  • Pass in to operators as a list of Artifact classes
  • If the optarg artifact_types received by an operator is None, fall back to the default hardcoded list in that operator

Add SQLite operator

Store C2s in a SQLite database, for an easy, no-setup operator that's more convenient to actually use than CSV.

Proposed database layout

One table per artifact type: domain, hash, ipaddress, url, yarasignature, task.

Each table's schema can be the same:

  • artifact: text primary key
  • reference_link: text
  • reference_text: text
  • created_date: text (filled by datetime('now', 'utc'))
  • state: text (initially null, for external use only)

Improve observability

Need to be able to see what's going on internally, in order to improve results, track bugs more easily, and provide summary reporting.

Add SQS as a Source module.

Add an SQS Source module. Will allow full-circle workflow. ThreatIngestor will classify/deobfuscate/filter input and send it to configured outputs. Doubles as SQS support for ThreatKB.

One example workflow:

  1. Receive tweet https://twitter.com/_ddoxer/status/984080845056172034 in c2 list
  2. Send pastebin link to SQS
  3. SQS reader receives pastebin link, gets raw link, scrapes content
  4. SQS reader sends content as a job with reference link of original pastebin link, to ThreatIngestor SQS Source
  5. ThreatIngestor picks up job, processes, sends C2s to configured outputs

Publish "crawlerlib" and add as a dependency

Need the ioc extraction portions of "crawlerlib" published, preferably as a library on PyPi.

We currently use the following features:

Have discussed possibly publishing an ioc extractor library with these pieces, and moving some of the pre/postprocessing implemented elsewhere in ThreatIngestor into that library as well.

Once this library is published, add it as a dependency to requirements.txt.

This is a hard dependency for initial release, as ThreatIngestor requires the IOC extraction to function.

Add Twitter hashtags as tags in ThreatKB

When importing a C2 hit to ThreatKB from Twitter, include hashtags as tags.

This requires the following changes:

  • Add tags member to Artifact model
  • Extract hashtags (tweet['entities']['hashtags']) from tweets in Twitter source plugin
  • Push artifact.tags as tags to ThreatKB from ThreatKB operator plugin

Add abstract JSON source/operator

  • Design and implement a source
  • Write an operator using the existing interpolation code from SQS
  • Rebase SQS source to extend from this.
  • Rebase SQS operator to extend from this.

Use YAML for config instead of INI

Solves the auth duplication issues and allowing multiple auths issue. Cleaner and still human readable/writable.

Depends on #44.

Final Design

general:
    sleep: 1500
    daemon: true

credentials:
  - name: twitter-myuser
    token: EXAMPLE
    token_key: EXAMPLE
    con_secret_key: EXAMPLE
    con_secret: EXAMPLE

sources:
  - name: twitter-open-directory
    credentials: twitter-myuser
    module: twitter
    saved_state: 
    q: '"open directory" #malware'

  - name: twitter-inquest-c2-list
    credentials: twitter-myuser
    module: twitter
    saved_state: 
    owner_screen_name: InQuest
    slug: c2-feed

operators:
  - name: mycsv
    module: csv
    filename: output.csv
    allowed_sources: [mysource, myothersource]
    filter: '([^\.]google.com$|google.com[^/])'
    artifact_types: [URL, Domain]
Old brainstorming info

Currently, config looks like this:

[source:twitter-inquest-c2-list]
module = twitter
saved_state = 
token = EXAMPLE
token_key = EXAMPLE
con_secret_key = EXAMPLE
con_secret = EXAMPLE
owner_screen_name = InQuest
slug = c2-feed

[source:twitter-open-directory]
module = twitter
saved_state = 
token = EXAMPLE
token_key = EXAMPLE
con_secret_key = EXAMPLE
con_secret = EXAMPLE
q = "open directory" #malware

When adding multiple plugins with the same module, you have to specify credentials each time, which is a hassle. I'd like to get rid of that requirement, and let people set up something that looks more like this:

twitter-myuser:
    module: twitter
    token: EXAMPLE
    token_key: EXAMPLE
    con_secret_key: EXAMPLE
    con_secret: EXAMPLE

    sources:
        twitter-open-directory:
            saved_state: 
            q: '"open directory" #malware'

        twitter-inquest-c2-list:
            saved_state: 
            owner_screen_name: InQuest
            slug: c2-feed

In this case, you only have to define the credentials once per account, and you can still use multiple Twitter accounts if desired, by creating a second base object, e.g.:

twitter-myotheruser:
    module: twitter
    token: EXAMPLE

...

I'm not sure how clean the implementation would be for this, so we should think about and map out a complete design for this to see if there's a better way to do it. For example, it might be better to have separate base sections for credentials, sources, and operators, and just include a reference to a certain named credential in each source or operator, something like this:

credentials:
    twitter-myuser:
        token: EXAMPLE
        token_key: EXAMPLE
        con_secret_key: EXAMPLE
        con_secret: EXAMPLE

sources:
    twitter-open-directory:
        credentials: twitter-myuser
        module: twitter
        saved_state: 
        q: '"open directory" #malware'

    twitter-inquest-c2-list:
        credentials: twitter-myuser
        module: twitter
        saved_state: 
        owner_screen_name: InQuest
        slug: c2-feed

Or to abstract it even further, and allow including any section in any other section:

twitter-myuser:
    module: twitter
    token: EXAMPLE
    token_key: EXAMPLE
    con_secret_key: EXAMPLE
    con_secret: EXAMPLE

source-twitter-open-directory:
    include: twitter-myuser
    saved_state: 
    q: '"open directory" #malware'

source-twitter-inquest-c2-list:
    include: twitter-myuser
    saved_state: 
    owner_screen_name: InQuest
    slug: c2-feed

This is the most flexible design, but I'm not sure whether I prefer having a defined source section, vs having the plugin type be based off the name (source-*, operator-*), vs having a key that defines the type (type: source). We need some way to differentiate sources and operators, since there can be duplication between them (e.g. SQS is both a source and an operator).

Moved from comment below.

Looking at some example YAML configs in other tools (Kubernetes, Ansible, etc), it seems having a list of items, each with a name key, is a common pattern. This makes me lean towards the second proposal above, with a slight modification:

sources:
  - name: twitter-open-directory
    credentials: twitter-myuser
    module: twitter
    saved_state: 
    q: '"open directory" #malware'

  - name: twitter-inquest-c2-list
    credentials: twitter-myuser
    module: twitter
    saved_state: 
    owner_screen_name: InQuest
    slug: c2-feed

I'm still not sure how to best handle credentials. The third proposal's "include" solution is more flexible, but I can't think of a case where you'd want to reuse anything that wasn't credentials. Having creds in their own named "credentials" section seems clearer for the end user, but doesn't accurately represent how the config parsing would be implemented, as you could feasibly define any parameters in the "credentials" section and reuse them for some other purpose.

Add filesystem source

Point it at a file/directory and it will recursively read and extract artifacts similar to iocextract CLI.

Create a queue-based source/operator that can be used as a simpler alternative to SQS

Like SQS, but for named pipes.

This just makes it easier for people to get set up without needing an AWS account and SQS tubes already created.

Base off of #23. Rebase on top of #23 once that issue is done. I think this one is an easier starting point so we don't have to worry about SQS.

Basic Design Proposal

Source

  • New FIFO source with one required param: filename.
  • Each line read from the pipe is considered one "job".
  • Each job is parsed as JSON with escaped newlines.
  • Until #23 is done, each job should have 1 required key, content.
  • Treat content values as plain text and extract artifacts as normal.
  • Saved state will always be None, since FIFOs maintain their own state.

Operator

  • New FIFO operator with one required param: filename, and N optional dynamic params.
  • Copy logic for dynamic JSON creation from SQS operator.
  • Encode contents to escape newlines, so each job is always only 1 line.
  • Write each job to the pipe.

Add condition support to operators

Add some new fields to operator configuration sections to allow more flexible use of operators. This will open ThreatIngestor up to run multiple discrete tasks (e.g. send Twitter "open directory" results to a crawler, and send Twitter List c2 hits to ThreatKB) from a single instance and single config file.

Add support for the following fields:

  • allowed_sources: Comma-separated, whitespace-stripped, wildcard-supported list of source definitions (e.g. twitter-c2-list,rss-*). Only artifacts from these sources are sent to the operator.
  • artifact_conditions: Comma-separated, whitespace-stripped list of predefined conditions (e.g. disallow_ip that would only let through URL artifacts if they use a FQDN and not an IP address).

Additionally:

  • Add a new Conditions class, all conditions will have a function that can be passed an Artifact and return True or False. If True, the artifact will be processed, otherwise it will be skipped.
  • Document how to create Conditions classes to extend the tool, similar to how Source and Operator modules are described in the README.

Add 'extras' folder

Add SQS workers, web UI, and anything else that doesn't fit into the plugin architecture.

Document everything in here in a new section in the docs.

Ideas:

  • File watcher (#34) queue worker
  • Tiny web UI
  • Link download queue worker

Add GitHub Search source plugin

Add GitHub Search source plugin, for checking e.g. CVE*** in newest repos. Hits would be imported as a Task artifact for use with ThreatKB.

Depends on #5

Github source doesn't iterate through results

While the top of the request shows the number of entries, that is not how many are returned -- results are paginated. Adding the parameter per_page=100 sets the maximum return, and page= goes through all the results. At minimum we should be scraping the maximum of the page, however it's up for question whether we really want all the results if the query is vague

Backfill unit tests

Write unit tests for existing code.

RSS Source

  • run_respects_saved_state
  • run_does_preprocessing_deobfuscation
  • run_respects_feed_type
  • run_supports_both_content_summary
  • run_supports_both_link_url
  • run_returns_top_item_date_as_saved_state
  • run_returns_artifacts_correctly

Twitter Source

  • init_detects_search_type
  • run_respects_saved_state
  • run_returns_newest_tweet_id_as_saved_state
  • run_supports_all_endpoints
  • run_returns_artifacts_correctly

ThreatKB Operator

  • handle_domain_creates_domain
  • handle_ipaddress_creates_ipaddress
  • handle_yarasignature_creates_yarasignature

SQS Operator

  • handle_url_discards_ip_urls
  • handle_url_passes_kwargs

Config

  • daemon_returns_main_daemon_bool
  • sleep_returns_main_sleep_int
  • sources_returns_list_of_all_source_tuples
  • sources_excludes_internal_options_from_kwargs
  • operators_returns_list_of_all_operator_tuples
  • operators_excludes_internal_options_from_kwargs
  • save_state_writes_saved_state
  • get_state_reads_saved_state

Source base class

  • init_raises_not_implemented
  • run_raises_not_implemented
  • truncate_length_is_respected
  • content_is_preprocessed
  • urls_are_extracted
  • ips_are_extracted
  • yara_sigs_are_extracted
  • urls_matching_reference_link_are_discarded
  • nonobfuscated_urls_are_discarded
  • nonobfuscated_urls_are_included_if_specified

Operator base class

  • default_artifact_types_is_empty
  • handle_artifact_raises_not_implemented
  • process_includes_only_artifact_types

Ingestor class

  • init_creates_sources_operators_dicts
  • run_checks_config_daemon
  • run_once_calls_run_process_save_state

Add Task artifact type

Need a new artifact type for use with ThreatKB. Tasks would include things that can't be treated as a YARA rule or C2 hit - like manually investigating a CVE or an interesting blog post.

  • #12
  • Add Task support to ThreatKB library
  • Add Task artifact class
  • Add handle_task method to ThreatKB operator
  • Add Task as a supported artifact to applicable Sources

Required for #4

Figure out how to handle twitter links

There's not a good way right now to handle twitter t.co links vs defanged links. Decide on the best way to do this, either on the twitter end or via operator filters.

Clarify Twitter Credentials

The Twitter credentials in the config.yml and documentation can be more clear:

Instead of the current

- name: twitter-auth
    # https://dev.twitter.com/oauth/overview/application-owner-access-tokens
    token:
    token_key:
    con_secret_key: 
    con_secret: 

This would be more clear, as it uses the same terms and order of keys that Twitter presents

- name: twitter-auth
    # https://dev.twitter.com/oauth/overview/application-owner-access-tokens
    api_key:
    api_key_secret:
    access_token: 
    access_token_secret: 

Add Sample artifact type

Add a new Sample artifact type to hold links to downloadable samples from app.any.run, hybrid-analysis, etc.

  • Add Sample artifact class
    • Extend from URL class
    • Fields: atrifact will be the original extracted url
    • Str method: modify the original url as needed to produce direct download URL
  • Extract Sample artifacts in sources.process_element
    • Include only nonobfuscated urls from the sample domain whitelist
  • Add support for new artifact to SQS and CSV operators

Handle malformed input errors better

There are some situations where unexpected input (like weirdly formed RSS content) causes the program to crash.

We should just do best-effort parsing in these cases. Unexpected input should never cause an uncaught exception; same with network errors/etc. If an input is bad, we can just log something and skip it.

This is a requirement for initial release.

Add Hash artifact type

Extract md5/shasums as a new Artifact type.

  • Add Hash artifact class
    • Include a method for returning the hash type based on the length
  • Extract hashes in sources.process_element
  • Add support for the new artifact to CSV operator

Improve documentation

  • Add/improve inline documentation
  • Generate sphinx docs
  • Add example workflows
  • Document "queue worker" workflow.
  • Document credential reuse.
  • Document artifact message interpolation.
  • Document abstract_json input/output.
  • Document 'extras'.
  • Document missing plugins / options.
  • Fill out installation instructions with troubleshooting for extra dependencies.

Fix URL "defanging" for encoded URLs

Currently, URLs that come from iocextract as hex-encoded, url-encoded, etc are not being "refanged" (decoded) correctly. Need to handle these properly.

Add Git source plugin

Add a Git source plugin, for pulling down repositories and checking the new files for YARA rules.

  • Write subprocess-git functions for the git commands we'll need
  • Use the hash of HEAD for the saved_state
  • Get new files with diff --name-only -- $saved_state
  • Any new files that match .{yar,yara,rule,rules}, run through the YARA regex, and extract individual rules

Add Twitter operator plugin

Needs some level of customization. Hashtags per-artifact-type, additional text/hashtag, reference link, etc. Ability to quote-tweet reference tweets.

Rework SQS source

Right now all it can do is read a link, fetch its contents, and extract IOCs. Need to extend it to accept more reasonable inputs. Keep in mind we'll be using this as a core piece of the pipeline models.

Base off of #23.

Publish ThreatKB library and add as a dependency

Publish the "threatkb" Python API as a library on PyPi, and add it as a dependency to requirements.txt here. This is a hard requirement for initial release. Be sure to update the threatkb library and the ThreatKB operator here to support API-key auth instead of user-password.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.