inquest / iocextract Goto Github PK

Defanged Indicator of Compromise (IOC) Extractor.

Home Page: https://inquest.readthedocs.io/projects/iocextract/

License: GNU General Public License v2.0

Python 100.00%

ioc indicators-of-compromise library ioc-extractor defang threat-intelligence threat-sharing threatintel malware-research osint

iocextract's Introduction

iocextract

Indicator of Compromise (IOC) extractor for some of the most commonly ingested artifacts.

Overview
Installation
Usage
- Library
- Command Line Interface
Helpful Information

Overview

The iocextract package is a library and command line interface (CLI) for extracting URLs, IP addresses, MD5/SHA hashes, email addresses, and YARA rules from text corpora. It allows for you to extract encoded and "defanged" IOCs and optionally decode or refang them.

The Problem

It is common practice for malware analysts or endpoint software to "defang" IOCs such as URLs and IP addresses, in order to prevent accidental exposure to live malicious content. Being able to extract and aggregate these IOCs is often valuable for analysts. Unfortunately, existing "IOC extraction" tools often pass right by them, as they are not caught by standard regex.

For example, the simple defanging technique of surrounding periods with brackets:

127[.]0[.]0[.]1

Existing tools that use a simple IP address regex will ignore this IOC entirely.

Our Solution

By combining specially crafted regex with some custom post-processing, we are able to both detect and deobfuscate "defanged" IOCs. This saves time and effort for the analyst, who might otherwise have to manually find and convert IOCs into machine-readable format.

Example Use Case

Many Twitter users post C2s or other valuable IOC information with defanged URLs. For example, this tweet from @InQuest:

Recommended reading and great work from @unit42_intel:
https://researchcenter.paloaltonetworks.com/2018/02/unit42-sofacy-attacks-multiple-government-entities/ ...
InQuest customers have had detection for threats delivered from hotfixmsupload[.]com
since 6/3/2017 and cdnverify[.]net since 2/1/18.

If we run this through the extractor, we can easily pull out the URLs:

https://researchcenter.paloaltonetworks.com/2018/02/unit42-sofacy-attacks-multiple-government-entities/
hotfixmsupload[.]com
cdnverify[.]net

Passing in refang=True at extraction time would remove the obfuscation, but since these are real IOCs, let's leave them defanged in our documentation.

Installation

You may need to install the Python development headers in order to install the regex dependency. On Ubuntu/Debian-based systems, try:

sudo apt-get install python-dev

Then install iocextract from pip:

pip install iocextract

If you have problems installing on Windows, try installing regex directly by downloading the appropriate wheel from PyPI and installing via pip:

pip install regex-2018.06.21-cp27-none-win_amd64.whl

Usage

Library

Try extracting some defanged URLs:

import iocextract

content = \
"""
I really love example[.]com!
All the bots are on hxxp://example.com/bad/url these days.
C2: tcp://example[.]com:8989/bad
"""

for url in iocextract.extract_urls(content):
    print(url)

    # Output

    # hxxp://example.com/bad/url
    # tcp://example[.]com:8989/bad
    # example[.]com
    # tcp://example[.]com:8989/bad

NOTE: Some URLs may show up twice if they are caught by multiple regexes.

If you want, you can also "refang", or remove common obfuscation methods from IOCs:

import iocextract

for url in iocextract.extract_urls(content, refang=True):
    print(url)

    # Output

    # http://example.com/bad/url
    # http://example.com:8989/bad
    # http://example.com
    # http://example.com:8989/bad

If you don't want to defang the extracted IOCs at all during extraction, you can disable this as well:

import iocextract

content = \
"""
http://example.com/bad/url
http://example.com:8989/bad
http://example.com
http://example.com:8989/bad
"""

for url in iocextract.extract_urls(content, defang=False):
    print(url)

    # Output

    # http://example.com/bad/url
    # http://example.com:8989/bad
    # http://example.com
    # http://example.com:8989/bad

All extract_* functions in this library return iterators, not lists. The benefit of this behavior is that iocextract can process extremely large inputs, with a very low overhead. However, if for some reason you need to iterate over the IOCs more than once, you will have to save the results as a list:

import iocextract

content = \
"""
I really love example[.]com!
All the bots are on hxxp://example.com/bad/url these days.
C2: tcp://example[.]com:8989/bad
"""

print(list(iocextract.extract_urls(content)))
# ['hxxp://example.com/bad/url', 'tcp://example[.]com:8989/bad', 'example[.]com', 'tcp://example[.]com:8989/bad']

Command Line Interface

A command-line tool is also included:

$ iocextract -h
    usage: iocextract [-h] [--input INPUT] [--output OUTPUT] [--extract-emails]
                  [--extract-ips] [--extract-ipv4s] [--extract-ipv6s]
                  [--extract-urls] [--extract-yara-rules] [--extract-hashes]
                  [--custom-regex REGEX_FILE] [--refang] [--strip-urls]
                  [--wide]

    Advanced Indicator of Compromise (IOC) extractor. If no arguments are
    specified, the default behavior is to extract all IOCs.

    optional arguments:
      -h, --help            show this help message and exit
      --input INPUT         default: stdin
      --output OUTPUT       default: stdout
      --extract-emails
      --extract-ips
      --extract-ipv4s
      --extract-ipv6s
      --extract-urls
      --extract-yara-rules
      --extract-hashes
      --custom-regex REGEX_FILE file with custom regex strings, one per line, with one capture group each
      --refang              default: no
      --strip-urls          remove possible garbage from the end of urls. default: no
      --wide                preprocess input to allow wide-encoded character matches. default: no

NOTE: Only URLs, emails, and IPv4 addresses can be "refanged".

Helpful Information

FAQ

Are you...

Q. Extracting possibly-defanged IOCs from plain text, like the contents of tweets or blog posts?

A. Yes! This is exactly what iocextract was designed for, and where it performs best. Want to go a step farther and automate extraction and storage? Check out ThreatIngestor.

Q. Extracting URLs that have been hex or base64 encoded?

A. Yes, but the CLI might not give you the best results. Try writing a Python script and calling iocextract.extract_encoded_urls directly.

Note: You will most likely end up with extra garbage at the end of URLs.

Q. Extracting IOCs that have not been defanged, from HTML/XML/RTF?

A. Maybe, but you should consider using the --strip-urls CLI flag (or the strip=True parameter in the library), and you may still get some extra garbage in your output. If you're extracting from HTML, consider using something like Beautiful Soup to first isolate the text content, and then pass that to iocextract, like this.

Q. Extracting IOCs that have not been defanged, from binary data like executables, or very large inputs?

A. There is a very simplistic version of this available when running as a library, but it requires the defang=False parameter and could potentially miss some of the IOCs. The regex in iocextract is designed to be flexible to catch defanged IOCs. If you're unable to collect the information you need, consider using something like Cacador instead.

More Details

This library currently supports the following IOCs:

IP Addresses
- IPv4 fully supported
- IPv6 partially supported
URLs
- With protocol specifier: http, https, tcp, udp, ftp, sftp, ftps
- With [.] anchor, even with no protocol specifier
- IPv4 and IPv6 (RFC2732) URLs are supported
- Hex-encoded URLs with protocol specifier: http, https, ftp
- URL-encoded URLs with protocol specifier: http, https, ftp, ftps, sftp
- Base64-encoded URLs with protocol specifier: http, https, ftp
Emails
- Partially supported, anchoring on @ or at
YARA rules
- With imports, includes, and comments
Hashes
- MD5
- SHA1
- SHA256
- SHA512
Telephone numbers
Custom regex
- With exactly one capture group

For IPv4 addresses, the following defang techniques are supported:

Technique	Defanged	Refanged
`.` -> `[.]`	1[.]1[.]1[.]1	1.1.1.1
`.` -> `(.)`	1(.)1(.)1(.)1	1.1.1.1
`.` -> `\.`	1\.1\.1\.1	1.1.1.1
Partial	1[.1[.1.]1	1.1.1.1
Any combination	1.)1[.1.)1	1.1.1.1

For email addresses, the following defang techniques are supported:

Technique	Defanged	Refanged
`.` -> `[.]`	me@example[.]com	[email protected]
`.` -> `(.)`	me@example(.)com	[email protected]
`.` -> `{.}`	me@example{.}com	[email protected]
`.` -> `_dot_`	me@example dot com	[email protected]
`@` -> `[@]`	me[@]example.com	[email protected]
`@` -> `(@)`	me(@)example.com	[email protected]
`@` -> `{@}`	me{@}example.com	[email protected]
`@` -> `_at_`	me at example.com	[email protected]
Partial	me@} example[.com	[email protected]
Added spaces	me@example [.] com	[email protected]
Any combination	me @example [.)com	[email protected]

For URLs, the following defang techniques are supported:

Technique	Defanged	Refanged
`.` -> `[.]`	`example[.]com/path`	`http://example.com/path`
`.` -> `(.)`	`example(.)com/path`	`http://example.com/path`
`.` -> `\.`	`example\.com/path`	`http://example.com/path`
Partial	`http://example[.com/path`	`http://example.com/path`
`/` -> `[/]`	`http://example.com[/]path`	`http://example.com/path`
Cisco ESA	`http:// example .com /path`	`http://example.com/path`
`://` -> `__`	`http__example.com/path`	`http://example.com/path`
`://` -> `:\\`	`http:\\example.com/path`	`http://example.com/path`
`:` -> `[:]`	`http[:]//example.com/path`	`http://example.com/path`
`hxxp`	`hxxp://example.com/path`	`http://example.com/path`
Any combination	`hxxp__ example( .com[/]path`	`http://example.com/path`
Hex encoded	`687474703a2f2f6578616d706c652e636f6d2f70617468`	`http://example.com/path`
URL encoded	`http%3A%2F%2fexample%2Ecom%2Fpath`	`http://example.com/path`
Base64 encoded	`aHR0cDovL2V4YW1wbGUuY29tL3BhdGgK`	`http://example.com/path`

NOTE: The tables above are not exhaustive, and other URL/defang patterns may also be extracted correctly. If you notice something missing or not working correctly, feel free to let us know via the GitHub Issues.

The base64 regex was generated with @deadpixi's base64 regex tool.

Custom Regex

If you'd like to use the CLI to extract IOCs using your own custom regex, create a plain text file with one regex string per line, and pass it in with the --custom-regex flag. Be sure each regex string includes exactly one capture group.

For example:

http://(example\.com)/
(?:https|ftp)://(example\.com)/

This custom regex file will extrac the domain example.com from matching URLs. The (?: ) noncapture group won't be included in matches.

If you would like to extract the entire match, just put parentheses around your entire regex string, like this:

(https?://.*?.com)

If your regex is invalid, you'll see an error message like this:

Error in custom regex: missing ) at position 5

If your regex does not include a capture group, you'll see an error message like this:

Error in custom regex: no such group

Always use a single capture group when working with custom regex. Here's a quick example:

[
    r'(my regex)',  # This yields 'my regex' if the pattern matches
    r'my (re)gex',  # This yields 're' if the pattern matches
]

Using more than a single capture group can cause unexpected results. Check out this example:

[
    r'my regex',  # This doesn't yield anything
    r'(my) (re)gex',  # This yields 'my' if the pattern matches
]

Why? Because the result will always yield only the first group match from each regex.

For more complicated regex queries, you can combine capture and non-capture groups like so:

[
    r'(?:my|your) (re)gex',  # This yields 're' if the pattern matches
]

You can now compare the (?: ) syntax for noncapture groups vs the ( ) syntax for the capture group.

Related Projects

If iocextract doesn't fit your use case, several similar projects exist. Check out the defang and indicators-of-compromise tags on GitHub, as well as:

Cacador in Go
ioc-extractor in JavaScript
Cyobstract in Python

If you'd like to automate IOC extraction, enrichment, export, and more, check out ThreatIngestor.

If you're working with YARA rules, you may be interested in plyara.

Contributing

If you have a defang technique that doesn't make it through the extractor, or if you find any bugs, Pull Requests and Issues are always welcome. The library is released under a GPL-2.0 license.

Who's using iocextract?

Are you using it? Want to see your site listed here? Let us know!

iocextract's People

Contributors

Stargazers

Watchers

Forkers

deadbits ralex1975 tamieem j-charles7 4n6strider krispimk threatinteltest reanimat0r culina gregcopenhaver ninoseki re0x90 asrabon ktp-forked-repos pmichaudii raystyle rshipp materaj2 techlord-rce seth1002 heikipikker darkfinder rahmiy idkwim averroes rajivraj huseyinrencber acharyarohan anarquias 8589 moriweiji netsec vuln awesome-security larrycameron80 binarybytes polidorino merenguelkl atdanny alirezachegini gwg422 anylayer netwrkspider imuledx youngjun-chang 5l1v3r1 kp-forks supernothing crackercat imendax idev yogeshlc bdllerena shadowscatcher sky067 zlzhangv007 alexjnorton fzxcp3 presianbg dsfinn sh9369 wildactual osinter-project onesorzer0es bkaathi theaj42 litchi125 alchemycyberblaze nanda-rani wilsonswee long80a data-gami hartl3y94 fiverest ikpehlivan dav-dlx dk26 sima456 iq-scm ravyn13 zawadidone teddymwai ethicalsecurity-agency abdulsattar-arctic professorkilo morw1nd n4n4r3

iocextract's Issues

URL path defang and Email extraction

I noticed that if the URL was something like this: hxxps://momorfheinz[.]usa[.]cc/login[.]microsoftonline[.]com then it would only defang that it only fixed the netloc portion of the URL. Also, made a change to the email regex.

What do you think?

diff --git a/iocextract.py b/iocextract.py
index 814ad8a..fc2d80b 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -124,7 +124,7 @@ IPV6_RE = re.compile(r"""
         \b(?:[a-f0-9]{1,4}:|:){2,7}(?:[a-f0-9]{1,4}|:)\b
     """, re.IGNORECASE | re.VERBOSE)
 
-EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)")
+EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+[\s]*@[\s]*[a-zA-Z0-9-]+[[]*\.[]]*[a-zA-Z0-9-.]+)")
 MD5_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{32})(?:[^a-fA-F\d]|\b)")
 SHA1_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{40})(?:[^a-fA-F\d]|\b)")
 SHA256_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{64})(?:[^a-fA-F\d]|\b)")
@@ -247,7 +247,7 @@ def extract_emails(data):
     :rtype: Iterator[:class:`str`]
     """
     for email in EMAIL_RE.finditer(data):
-        yield email.group(0)
+        yield email.group(0).replace(" ", "").replace("[.]", ".")
 
 def extract_hashes(data):
     """Extract MD5/SHA hashes.
@@ -420,6 +420,7 @@ def refang_url(url):
     # Fix example[.]com, but keep RFC 2732 URLs intact.
     if not _is_ipv6_url(url):
         parsed = parsed._replace(netloc=parsed.netloc.replace('[', '').replace(']', ''))
+        parsed = parsed._replace(path=parsed.path.replace('[.]', '.'))
 
     return parsed.geturl()

Add CLI usage examples to README

Review documentation

Need to review the documentation and verify it's still up to date. Also, it appears to be failing in certain sections.

Add wide-character support to all extractions

Email Obfuscation Edit

Identify and 'refang' emails formatted as follows:
identifier[@]domain[.com]

Add support for base64-encoded URLs

Defanged first.last Email Addresses do not refang correctly

Issue seems to be around this defangged format:

firstname[.]lastname[@]domainname[.]org

When refanged, seeing the following:

[email protected]

For some reason, the username format of first.last is getting chopped off to just last.

refang_url converts unknown schemes (such as 'tcp') to 'http'

It seems that refang'ing urls with a scheme not listed in line: https://github.com/InQuest/python-iocextract/blob/4da913206d8e94a6a3b137c011c89e9707cb3966/iocextract.py#L626
replaces it with 'http': https://github.com/InQuest/python-iocextract/blob/4da913206d8e94a6a3b137c011c89e9707cb3966/iocextract.py#L631.

Maybe a hard-coded conversion mapping could be used, e.g.:

refang_schemes = {
    'http': ['hxxp'],
    'https': ['hxxps'],
    'ftp': ['ftx', 'fxp'],
    'ftps': ['ftxs', 'fxps']
}
for scheme, fanged in refang_schemes.items():
    if parsed.scheme in fanged:
        parsed = parsed._replace(scheme=scheme)
        url = parsed.geturl().replace(scheme + ':///', scheme + '://')

        try:
            _ = urlparse(url)
        except ValueError:
            # Last resort on ipv6 fail.
            url = url.replace('[', '').replace(']', '')

        parsed = urlparse(url)

        break

This is not as catch-all as the current solution, but on the other hand it does not alter the indicator.

Example:

In [1]: import iocextract                                                                              

In [2]: content = """tcp://example[.]com:8989/bad"""                                                   

In [3]: list(iocextract.extract_urls(content))                                                         
Out[3]: ['tcp://example[.]com:8989/bad', 'tcp://example[.]com:8989/bad']

In [4]: list(iocextract.extract_urls(content, refang=True))                                            
Out[4]: ['http://example.com:8989/bad', 'http://example.com:8989/bad']

Note: This behavior is shown in the output examples in the README.rst in the 'Usage' section related to refang.

Improve extraction for non-defanged URLs

"while it seems like the bug originally referenced in this issue is fixed in the new version, the one I commented above still exists. Defanged IPs still get extracted by extract_urls while their non-defanged counterparts don't"

Issue comment: #34 (comment)

Improve IPv6 extraction

Things that look like timestamps, and things like 1:6:0, are getting through. If we can't improve the regex to catch these, maybe add a filter on the iterator?

Handle extraction from all files in a directory

It'd be great to be able to provide a directory path to iocextract and have it iterate over all files, extracting IOC's from each as it goes.

for example, i have a directory of malicious SLK files and I want to quickly dump all the URLs. right now I have to use something like for i in ls; do iocextract --extract-urls --input $i; done

passing a dir to --input obviously throws an exception due to the arguments use to io:

 File "iocextract.py", line 442, in <lambda>
    parser.add_argument('--input', type=lambda x: io.open(x, 'r', encoding='utf-8', errors='ignore'),
IOError: [Errno 21] Is a directory: '/home/adam/research/malware/campaigns/slk-droppers'

Would you be okay with re-working --input to accept a file as input, stdin as an optional positional argument, and add a --dir argument for folders? I can put in a PR if so - or if you have any other suggestions for this use case, that'd be great :D

'https' scheme values defanged as HXXPS are refanged as 'http'

As in title; e.g.

'hxxps://example.com' is refanged as 'http://example.com'

Add extraction for hex-encoded URLs

In progress...

Refang excepts in certain cases

We do the urlparse try/except before modifiying the URL, which may cause it to error out after we prepend the scheme. Need to just move all the url modifications before the urlparse test.

base64 strings

Hey,

I was looking to use this for decoding some base64 strings inside json and it did not see to find the following when using refang.

  },
      "data": {
        ".dockerconfigjson": "ewoJImF1dGhzIjogewoJCSJjZGUtZG9ja2VyLXJlZ2lzdHJ5LmVpYy5mdWxsc3RyZWFtLmFpIjogewoJCQkiYXV0aCI6ICJZMlJsTFhKbFoybHpkSEo1T21Oa1pTMXlaV2RwYzNSeWVRPT0iCgkJfQoJfQp9"
      },

Any way to improve this at all?

Refactor regex with re.X

https://docs.python.org/3/library/re.html#re.X

Will improve readability and maintainability of the regexes.

PyPi License Mismatch

Hey, just letting you know that in PyPi your package is listed as BSD. This is likely due to your configuration in setup.py classifiers. Cheers!

Add the function --extract-domains and --extract-subdomains

Sometimes it is necessary to simply extract the domains and or the domains and subdomains.

And a question, are the new longer domain extensions included?

Extract domain names without URI scheme

I was trying to pull out a list of domains from a text file input (sample of input / expected output below), but iocextract doesn't recognize anything without a URI scheme I think.

Is it possible to include an --extract-domains, or have --extract-urls optionally ignore the scheme for instance? Just random thoughts, not sure the best way to handle this given how complicated the regex is.

If it's any help, this pattern ([a-zA-Z0-9-_]+(\.)+)?([a-z0-9-_]+)*\.+[a-z]{2,63} should match pretty much any domain name up to the TLD.

matches:

google.com
foo.mywebsite.io
hack-the-planet.com
asdf-fdsa.foo-bar.com
foo-bar.domain.name.com

Sample Input

GLOBAL
Pool    Location    Total Fee/Donations Hashrate    Miners  Link
supportXMR.com
PPLNS exchange payout custom threshold workerIDs email monitoring SSL Android APP   DE,FR,US,CA,SG  0.6 %   86.79 MH/s  7228
xmrpool.net
PPS PPLNS SOLO exchange payout custom threshold workerIDs email monitoring SSL  USA/EU/Asia 0.4-0.6 %   642.32 KH/s 179
xmr.nanopool.org
PPLNS exchange payout workerIDs email monitoring SSL    USA/EU/Asia 1 % 105.52 MH   6155
 minergate.com
possible share skimming! People complaining about poor hashrate.
RBPPS PPLNS USA/EU  1-1.5 % 26.50 MH/s  37467
viaxmr.com
PPLNS exchange payout custom threshold workerIDs email monitoring SSL   US/UK/AU/JP 0.4 %    API problem     API problem
monero.hashvault.pro

Was hoping to get output of:

supportXMR.com
xmrpool.net
monero.hashvault.pro
minergate.com

URL is not extracted correctly

When I ran the sample script with one line of text, all text was displayed without extracting URLs.

import iocextract

content = \
"""
All the bots are on hxxp://example.com/bad/url these days.
"""

for url in iocextract.extract_urls(content):
    print(url)

The output result is as follows.

$ python3 test.py

All the bots are on hxxp://example.com/bad/url these days.

BUG: --extract-ipv4s does not work

Unfortunately it doesn't work, I ran it for quite a while but except for stressing one CPU core 100% nothing happened, the IPs were not written to the file.
iocextract --input '/home/user/des.txt' --output '/home/user/k1.txt' --extract-ipv4s

how do I add a ioc_type label with the output?

This is probably more of a feature request...
Is there a way with the "extract_iocs" function to have it output the IOC Type next to the IOC?

I have a work around, but I have to call each function individually.

import iocextract
import pandas as pd
hashes = pd.DataFrame(iocextract.extract_sha256_hashes(glob), columns=['ioc'])
hashes['ioc_type'] = "sha256_hash"
hashes

Extracting URLs that have been base64 encoded

Currently, it seems like iocextract extracts only the first URL found in a base64 encoded string.

For example for the following string (original):
'https://google.com https://amazon.com https://microsoft.com http://google.com http://amazon.com http://microsoft.com'
the base64 encoded string is: 'aHR0cHM6Ly9nb29nbGUuY29tIGh0dHBzOi8vYW1hem9uLmNvbSBodHRwczovL21pY3Jvc29mdC5jb20gaHR0cDovL2dvb2dsZS5jb20gaHR0cDovL2FtYXpvbi5jb20gaHR0cDovL21pY3Jvc29mdC5jb20g'
and only the first found URL is returned.

If I change the sequence of the URLs in the original string and then encode it with base 64, iocextract will return the URL that occurs first this time.

Can you please fix this and return all the URLs existing in a base64 encoded string?

Look in to adding \. defang detection

Example: https://twitter.com/ClearskySec/status/1001833343581900800

c2: www.nubpubwizard.jetos\.com
c2: worktrs.wikaba\.com
Spoofs host header

Exception with some unicode in URLs

Traceback (most recent call last):
  File "iocextract", line 11, in <module>
    sys.exit(main())
  File "local/lib/python2.7/site-packages/iocextract.py", line 433, in main
    for ioc in extract_urls(args.input.read(), refang=args.refang, strip=args.strip_urls):
  File "local/lib/python2.7/site-packages/iocextract.py", line 155, in extract_urls
    url = refang_url(url.group(1))
  File "local/lib/python2.7/site-packages/iocextract.py", line 395, in refang_url
    return parsed.geturl()
  File "/usr/lib64/python2.7/urlparse.py", line 134, in geturl
    return urlunparse(self)
  File "/usr/lib64/python2.7/urlparse.py", line 231, in urlunparse
    return urlunsplit((scheme, netloc, url, query, fragment))
  File "/usr/lib64/python2.7/urlparse.py", line 242, in urlunsplit
    url = '//' + (netloc or '') + url
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 17: ordinal not in range(128)

Example url:

https://secure.comodo.net/CPS0C��U���<0:08�6�4�2http://crl.comodoca.com/COMODORSACodeSigningCA.crl0t+�����h0f0>+��0��2http://crt.comodoca.com/COMODORSACodeSigningCA.crt0$+��0���http://ocsp.comodoca.com0���U����0���[email protected]

URLs pulling in IPs

If I have a URL with a port - e.g. 1.1.1.1:449 I'm seeing a URL getting extracted in the format of:
http://1.1.1.1:449.

Is that desired behavior?

Add IP v4/v6 flags

Add two new flags:

--extract-ipv4s
--extract-ipv6s

File redirection doesn't work

If I run iocextract.py --input info.txt it will correctly print indicators to what seems to be standard out, however iocextract.py --input info.txt | less simply gives the the "you've got nothing END" in less. It looks like however you're getting the handle to STDOUT isn't the actual STDOUT handle.

Tested on OS X 10.14.6 with Python 3.7.6.

SHA1 extracts

It appears that the extract for SHA1 only pulls the first 32 characters so it looks like a MD5 hash.

Can't decode url throw an error

Traceback (most recent call last):
  File "extract.py", line 18, in <module>
    for i in iocextract.extract_encoded_urls(f.read(), refang=True):
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 174: invalid start byte

I create an simple python script to find url on current directory with iocextract, but throw an error when using extract_encoded_urls

Subdomains and IPs in URLs are not always parsed correctly

Given defanged URLs with an IP address or a subdomain such as:

hXXps://192.168.149[.]100/api/info
hXXps://subdomain.example[.]com/some/path

The GENERIC_URL_RE regex returns the correct results. However, since they are also parsed with the BRACKET_URL_RE regex additional invalid results are also returned:

http://149.100/api/info
http://example.com/some/path

A simple change seems to fix the problem--assuming I'm not missing some false positive scenario.

diff --git a/iocextract.py b/iocextract.py
index 8fdb374..dcd25dd 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -66,7 +66,7 @@ GENERIC_URL_RE = re.compile(r"""
 BRACKET_URL_RE = re.compile(r"""
         \b
         (
-            [\:\/\\\w\[\]\(\)-]+
+            [\.\:\/\\\w\[\]\(\)-]+
             (?:
                 \x20?
                 [\(\[]

catastrophic backtracking in BACKSLASH_URL_RE

Pretty much the title, discovered this in a downstream project, https://github.com/s0md3v/Photon, commented on it there as well. Thought I'd leave the comment here too, the rest of the defang RE seem to work fine, but the backslash one seemed to cause a lot of hangs when I was using it.

Test against: http://myexample.com/dir/../path/escaping/../too/many/../dots/../in/../the/path/../cause/this/to/fail

ModuleNotFoundError: No module named 'iocextract'

I installed it in Arch Linux, unfortunately I only get an error message.

Steps:

sudo pipx install iocextract --force
iocextract -h

$ /usr/bin/iocextract -h
Traceback (most recent call last):
  File "/usr/bin/iocextract", line 5, in <module>
    from iocextract import main
ModuleNotFoundError: No module named 'iocextract'

Fails to parse this url correctly

The url is:
https://www.mysite.com/endpoint?param=abc--~C<http://anothersite.com/myfile.zip>

the trailing > is always stripped off the url even through it is part of it. When I extract_iocs I get:
https://www.mysite.com/endpoint?param=abc--~C<http://anothersite.com/myfile.zip

I can give the real url that I discovered this issue with, but it is malicious so I didn't want to include it here.

Add defang function

Add a defang function that accepts a normal URL/domain/IP and returns a defanged version.

Example input/output:

Input	Output
http://example.com/path.ext	hxxp://example[.]com/path.ext
http://example.com/	hxxp://example[.]com/
example.com	example[.]com
127.0.0.1	127[.]0[.]0[.]1

I need this for ThreatIngestor, makes the most sense to include it here.

Various URL extraction issues

Hold-all issue for invalid URLs I find that come through extraction.

http:// NOTICE
https://redacted.sf-api.eu/</BaseUrl
https://ln.sync[.]com/dl/f6772eb20/d8yt6kez-9q7eef3m-ai27ebms-8zcufi5f (Please
http://as rsafinderfirewall[.]com/Es3tC0deR3name.exe):
http://domain rsafinderfirewall[.]com
http://example,\xa0c0pywins.is-not-certified[.]com
webClient.DownloadString(‘https://a.pomf[.]cat/ntluca.txt
http://HtTP:\\193[.]29[.]187[.]49\qb.doc\u201d
http://tintuc[.]vietbaotinmoi[.]com\u201d
espn[.]com.\u201d
http://calendarortodox[.]ro/serstalkerskysbox.png”
tFtp://cFa.tFrFa
h\u2013p://dl[.]dropboxusercontent[.]com/s/rlqrbc1211quanl/accountinvoice.htm
hxxp://paclficinsight.com\xa0POST /new1/pony/gate.php
http://at\xa0redirect.turself-josented[.]com
KDFB.DownloadFile('hxxps://authenticrecordsonline[.]com/costman/dropcome.exe',
at\xa0hxxp://paclficinsight[.]com/new1/pony/china.jpg
hxxp://<redacted>/28022018/pz.zip.\xa0
hxxp:// 23.89.158.69/gtop
h00p://bigdeal.my/gH9BUAPd/js.js"\uff1e\uff1c/script\uff1e
hxxp://smilelikeyoumeanit2018[.]com[.]br/contact-server/,
hxxp:// feeds.rapidfeeds[.]com/88604/
hxxp://www.xxx.xxx.xxx.gr/1.txt\u2019
h00p://119
h00p://218.84
hxxp:// "www.hongcherng.com"/rd/rd
http://http%3a%2f%2f117%2e18%2e232%2e200%2f
http://http%3a%2f%2fgaytoday%2ecom%2f
h00p://http://turbonacho(.)com/ocsr.html"\uff1e

URLs with wildcard/regex:

https://.+\.unionbank\.com/
https://.*citizensbank\.com/
https://(www\.|)svbconnect\.com/
https://(bolb\-(west|east)|www)\.associatedbank\.com/

Extracts part of the match as a second URL:

i[.]memenet[.]org/wfedgl[.]hta -> wfedgl[.]hta
http://196.29.164.27/ntc/ntcblock.html?dpid=1&dpruleid=3&cat=10&ttl=-200&groupname=Canar_staff&policyname=canar_staff_policy&username=[REDACTED]&userip=[REDACTED]&connectionip=127.0.0.1&nsphostname=NSPS01&protocol=policyprocessor&dplanguage=-&url=http%3a%2f%2fwww%2emonacogoldcasino%2ecom%2f” -> http%3a%2f%2fwww%2emonacogoldcasino%2ecom%2f

extract_unencoded_url is too greedy when parsing Windows command lines

I'm parsing input containing examples of PowerShell or cmd.exe command lines. When a command flag with a slash comes after an URL, then the flag is included in the extracted URL.

Here is an example:

list(iocextract.extract_unencoded_urls("command.exe https://pypi.org/project/iocextract/ /f"))
  # => ['https://pypi.org/project/iocextract/ /f']

The trailing /f should not be included in the extracted URL.

Failed to parse URL correctly

A URL which is surrounded by Japanese characters is not parsed correctly.

print(list(iocextract.extract_urls('『http://example.com』あああああ')))
# => ['http://example.com』あああああ']

# My expectation is ['http://example.com']

I'm not sure how to fix it. But I think checking TLD might work well.

IPv4 extraction doesn't recognize netstat command input

iocextract doesn't seem to recognize any IPv4 addresses from netstat output since they all end with .<port number> or the protocol. For example, 10.1.1.117.4222 and 10.1.1.117.https.
It pulls out IPv6 adddresses just fine, though.

This would be a super useful addition to have when triaging host events from an DFIR standpoint :)

Any suggested work around or is there a possible patch that would cover this?

Improve documentation

Add support for custom regex

Add a function that takes a list of regex strings as input, compiles them, and runs them against a data input, yielding results.
Add a flag to the CLI that takes a file and reads out regex into a list, then passes it to the above function and prints results.

IPv4 extract should be looser

This doesn't get extracted: 78.128.76.]]]165.

Improve YARA regex

Improve YARA regex to correctly extract things outside a standard rule { } format. This should include:

includes before a rule
imports before a rule
tags: https://yara.readthedocs.io/en/v3.8.1/writingrules.html#rule-tags
scopes (private and global): https://yara.readthedocs.io/en/v3.8.1/writingrules.html#global-rules

Related context: plyara/plyara#53.

Add a function to import directly from a server and extract IOCs.

Example iocextract --input 'https://toast.home.us/random' --output '/home/user/k1.txt' --extract-ipv4s

Found IPs being parsed as URLs

Hey! Currently working with iocextract to read from a text file and convert to a query. I just now ran in the issue where the IPs were being extracted as IPs but then they were also being extracted and formatted as URLs.
Input: 101.28[.]225[.]248 ---> Output: RemoteIP =~ "101.28.225.248" or RemoteUrl has "http://101.28.225.248"

module 'iocextract' has no attribute 'refang_url'

There is a problem with the latest version of the source code on pypi

Binary Extraction

Looking at how I might use something like this to pull indicators directly from malware binaries. Wondering if something like this could essentially run strings and extract ioc. Would also be nice to use this as a python library.

URL bracket regex is too loose

CDATA[^h00ps://online\(.)americanexpress\(.)com/myca/.*?request_type=authreg_acctAccountSummary]]＞

Should stop at the first character not in [\w-\[\]] when looking backwards. In this case the ^.

Even tighter, we can stop at the first character not in [\w] if it's before a ://.

URL generic regex is too loose

Require at least one [\w] character immediately following the ://. Exceptions for [\s\[\(]* to catch defangs.

Failed to extract the URLs from this tweet

The URLs from this tweet were not extracted by the tool.

I tried all the methods : extract_urls, extract_unencoded_urls, extract_encoded_urls , none of them worked. Is there a way to fix it ?