lookyloo / lookyloo Goto Github PK

Lookyloo is a web interface that allows users to capture a website page and then display a tree of domains that call each other.

Home Page: https://www.lookyloo.eu

License: Other

Python 69.07% CSS 0.99% JavaScript 6.07% HTML 23.69% Dockerfile 0.14% Shell 0.04%

information-security privacy web-security dfir capture scraping lookyloo

lookyloo's Introduction

Lookyloo is a web interface that captures a webpage and then displays a tree of the domains, that call each other.

What is Lookyloo?
REST API
Install Lookyloo
Lookyloo Client
Contributing to Lookyloo
- Code of Conduct
Support
- Security
- Credits
- License

What's in a name?!

Lookyloo ...

Same as Looky Lou; often spelled as Looky-loo (hyphen) or lookylou

1. A person who just comes to look.
2. A person who goes out of the way to look at people or something, often causing crowds and disruption.
3. A person who enjoys watching other people's misfortune. Oftentimes car onlookers that stare at a car accidents.

In L.A., usually the lookyloos cause more accidents by not paying full attention to what is ahead of them.

Source: Urban Dictionary

No, really, what is Lookyloo?

Lookyloo is a web interface that allows you to capture and map the journey of a website page.

Find all you need to know about Lookyloo on our documentation website.

Here's an example of a Lookyloo capture of the site github.com

REST API

The API is self documented with swagger. You can play with it on the demo instance.

Installation

Please refer to the install guide.

Python client

pylookyloo is the recommended client to interact with a Lookyloo instance.

It is avaliable on PyPi, so you can install it using the following command:

pip install pylookyloo

For more details on pylookyloo, read the overview docs, the documentation of the module itself, or the code in this GitHub repository.

Notes regarding using S3FS for storage

Directory listing

TL;DR: it is slow.

If you have namy captures (say more than 1000/day), and store captures in a s3fs bucket mounted with s3fs-fuse, doing a directory listing in bash (ls) will most probably lock the I/O for every process trying to access any file in the whole bucket. The same will be true if you access the filesystem using python methods (iterdir, scandir...))

A workaround is to use the python s3fs module as it will not access the filesystem for listing directories. You can configure the s3fs credentials in config/generic.json key s3fs.

Warning: this will not save you if you run ls on a directoy that contains a lot of captures.

Versioning

By default, a MinIO bucket (backend for s3fs) will have versioning enabled, wich means it keeps a copy of every version of every file you're storing. It becomes a problem if you have a lot of captures as the index files are updated on every change, and the max amount of versions is 10.000. So by the time you have > 10.000 captures in a directory, you'll get I/O errors when you try to update the index file. And you absolutely do not care about that versioning in lookyloo.

To check if versioning is enabled (can be either enabled or suspended):

mc version info <alias_in_config>/<bucket>

The command below will suspend versioning:

mc version suspend <alias_in_config>/<bucket>

I'm stuck, my file is raising I/O errors

It will happen when your index was updated 10.000 times and versioning was enabled.

This is how to check you're in this situation:

Error message from bash (unhelpful):

$ (git::main) rm /path/to/lookyloo/archived_captures/Year/Month/Day/index
rm: cannot remove '/path/to/lookyloo/archived_captures/Year/Month/Day/index': Input/output error

Check with python

from lookyloo.default import get_config
import s3fs

s3fs_config = get_config('generic', 's3fs')
s3fs_client = s3fs.S3FileSystem(key=s3fs_config['config']['key'],
                                secret=s3fs_config['config']['secret'],
                                endpoint_url=s3fs_config['config']['endpoint_url'])

s3fs_bucket = s3fs_config['config']['bucket_name']
s3fs_client.rm_file(s3fs_bucket + '/Year/Month/Day/index')

Error from python (somewhat more helpful):

OSError: [Errno 5] An error occurred (MaxVersionsExceeded) when calling the DeleteObject operation: You've exceeded the limit on the number of versions you can create on this object

Solution: run this command to remove all older versions of the file

mc rm --non-current --versions --recursive --force <alias_in_config>/<bucket>/Year/Month/Day/index

Contributing to Lookyloo

To learn more about contributing to Lookyloo, see our contributor guide.

Code of Conduct

At Lookyloo, we pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community. You can access our Code of Conduct here or on the Lookyloo docs site.

Support

To engage with the Lookyloo community contact us on Gitter.
Let us know how we can improve Lookyloo by opening an issue.
Follow us on Twitter.

Security

To report vulnerabilities, see our Security Policy.

Credits

Thank you very much Tech Blog @ willshouse.com for the up-to-date list of UserAgents.

License

See our LICENSE.

lookyloo's People

Contributors

Stargazers

Watchers

Forkers

binaryflesh steveclement b4stet rvaughan inc775 lichnak 0xdec0de8 rafiot threatinteltest betillogalvan legend23 lucaadrian issmonitor sgnls oneiroi code4days th4nat0s crackercat orinocoz sbrichardson sernle sasqwatch benhe119 ocakgun jertwaz theshuvo guisx robertdigital red-infosec gitter-badger buildbricks raulsoledispa izogain thirumladevi fafnerkeyzee matt-ross16 numbuh474 vmdhhh newbdoc ninoseki cudeso cedricbonhomme amarjitghuman 11developer felalex57 5l1v3r1 vuvgag modulexcite oladeleo veronique-sidemeet t0pang4 kimmoal arhamyss vaginessa johnhess bahaahassanieh tthseus orion01500 bib0x markoetie grosa1 8yumzy88 iq-scm omaramin17 m4rm0k noorahsmith dev2x0 i85yl64 whisprer rm-yakovenko sohaib0399 ftoppi adrima01 antoniabk docarmorytech munstar0s jeffmartson n0r3f

lookyloo's Issues

SVG interactions

Main hostname tree:

click on icon (i.e. JS) -> displays box with all URLs loading a JS
click on hostname -> display all the related URLs (same format as hostnames: line 1: URL, Line 2: icons)

Overlay box:

click on icon (i.e. JS) -> download the content

CSV export

Will be nice ( yes again )....

To have the capacity to export the data, json is an option but most of the time CSV is the usable by most people.

HIT, Called by, [type... javascript, cookie, etc..]

vala :)

Collapse/expand tree/pop up window ambiguity

expand/collapse tree current links to windows, but text controls pop up window. Put text and tree circle on the same horizontal rule, and give them both a similar border, drop the inheritance like from between them. (or possibly from the right hand side of the new border?)

Option to disable or rename session cookies

LookyLoo sets a session cookie (boringly named session). This is an issue if LookyLoo is being used behind a reverse proxy with an access authorization system that also happens to set a cookie named session -- the effect is that:

request comes to the reverse proxy; reverse proxy does its magic and sets its session cookie to persist the authorization status;
request is sent further to the upstream (i.e. LookyLoo).
LookyLoo sets its own session cookie, since the one set by the reverse proxy does not conform to whatever LookyLoo expects
response is returned to the client -- with the LookyLoo session cookie overwriting the reverse proxy cookie
upon the next request, the whole dance starts over

This results in no session persistence and LookyLoo not working properly behind such a reverse proxy. It would be swell if it were possible to change the name of the session cookie set by LookyLoo so as not to clash with potential reverse proxy.

The cookie seems not necessary -- blocking Set-Cookie on the reverse proxy (so that it does not reach the browser) does not seem to result in loss of functionality.

For the record, a quick and dirty workaround for nginx is:

make sure the reverse proxy session cookie is not sent back to LookyLoo upstream;
make sure that any Set-Cookie header set by LookyLoo is blocked from reaching the user browser.

There does not seem to be a way of modifying cookie headers sent to upstreams directly in nginx config), so point 1. would either have to use Lua (like in our case) or some other method; point 2. can be done with proxy_hide_header Set-Cookie; nginx config directive.

Give transparent backgrounds to icons

Gimme PNGs instead of JPGs

Documentation: where does LookyLoo keep the scraped data

It would be helpful to have information where does LookyLoo keep the scraped data -- this would be required, for example, to set up volume-mounts in the docker volume so that scraped data persists across containers being recreated.

Reorganize scraping page

It is currently clumsy and difficult to use, need a mockup.

Missing functionalities listed here: #45

Icon functionality on the tree and on the detailed view

Icons should bring up relevant window:

Cookies icon on the tree should show cookies with associated information (URLs)
Cookies on the detailed view should show the whole cookie

Allow to export the graph as image

Change locales & referrer

Mockups

Heritable display of tree node (two types: URL & type) -> need to represent inheritance from host-name node
Confirmation box for save

A Folding search

Hello,

It would be nice to have a "search" which will find and unfold only the relevant path to the result of the search.

Get full response for everything (i.e. JS)

Need to figure out how to do that: https://splash.readthedocs.io/en/stable/scripting-ref.html#splash-response-body-enabled

ways to deal with pop up window clutter

needs a close all windows/return to start display/show all windows functionality

Screenshots

It would be an amazing improvement if screenshots of each of the HTML pages retrieved in the process of scraping were available via the interface for inspection (this would be very informative when researching a targeting phishing attack, for instance).

MISP Integration

Lookups:

Domains
URLs & Part of URL
Hashes of JS/exe, ...
Cookies

Push:

Domains
URLs & Part of URL
Any content (JS/exe, ...)
Cookies

HTTP Headers

Just look at them and figure out what to do.

Building tree before the page loads

because the tree is built and screenshotted before it fully loads, slow pages don't make sense with the tree.

Duplicates

Same cookies set by multiple websites
Same JavaScript / Executable / Json / ...

Nginx Gateway Timeout

Hello,

I am running Lookyloo in Production, and have nginx running.

Whenever I submit a URL for scanning, I get a page returned saying:

504 Gateway Time-out
nginx/1.14.0 (Ubuntu)

Here is the settings under vim /etc/nginx/sites-enabled/lookyloo

server {
    listen 80;
    server_name lookyloo;

    location / {
        proxy_pass_header Server;
        proxy_set_header Host $http_host;
        proxy_redirect off;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Scheme $scheme;
        proxy_connect_timeout 10;
        proxy_read_timeout 10;
        proxy_pass http://localhost:5100/;
    }
}

I can't find a solution to this issue, are you able to assist?

BS4 missing from requirements

In a pristine Debian stable python3 installation lookyloo is not able to start since the Beautiful Soup 4 python module is missing from the requirements.

Add marking for meta refresh in HTML

HTML http-equiv Attribute is a sneaky redirect method that allows a developer to redirect a user from a TLS page to a clear text without having the browser to scream.

Export all domains

It would be nice to export all the domains at once to compare them between runs.

Permanent URLs

Have a permanent URL for each run

The URL view of nodes is hard to read

Latest changes introduced Python 3.6 dependency and broke Dockerfile

In commit f6345e4 Formatted string literals are used (see this line).

This introduces a dependency to a minimum Python version of 3.6, which is not available on many distributions. It also breaks the current Dockerfile, since it is based on Debian Stretch, which has Python 3.4

Mark mixed TLS/non-TLS content

Double border (in red, with thicker outer border) all pages which load http content.

Integration with 3rd party services

Inheritance point on pop up windows is the same circle as the tree

this is visually awkward, but also implies functionality where there is none.

Make inheritance a small dot, to make it visually unobtrusive and to make it clear it has no functionality.

Report lookup redirects to index despite tree_uuid created

I observed the following behavior using https://www.circl.lu/urlabuse/

Go to https://www.circl.lu/urlabuse/
Insert a Link and hit Run lookup
Click the Link 'See on Lookyloo'
You are redirected to the index

The link contains a valid tree_uuid but it seems that lookup_report_dir doesn't return a valid report_dir and thus redirects you to the index.

After some moments the report is viewable.

Expected behavior:
Show an in progress notice while keeping the url intact to enable manuel refresh (F5) or redirect to the finished report once it is done.

Pre-defined lists

Lists can be domains or URLs.

Good websites (i.e. top 1000 from Alexa)
Bad websites (i.e. static export from most recent domains from MISP)
Privacy invasive websites (i.e. uBlock lists) => https://github.com/scrapinghub/adblockparser

Lists can be pre-loaded, and/or user defined

Windows need to be able to collapse and expand

windows/legends need a collapsing and expanding icon to preserve real estate, especially on smaller resolutions.

Detect and explain Data URLs

More details: https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URIs

Integration of URL Abuse

The goal is to asynchronously fire requests to URL Abuse after the scraping is over and while the tree is displayed:

Every URL will be sent to every relevant endpoints
Every domain will be resolved and sent to every relevant endpoints

Cannot scroll right or down in workspace

scroll is constrained to tree, but workspace often exceeds tree, without being able to access data or controls

Errors when setting up lookyloo.service

Hello,
Is anyone able to share their copy of /etc/systemd/system/lookyloo.service ?

Here is mine:

[Unit]
Description=uWSGI instance to serve lookyloo
After=network.target

[Service]
User=root
Group=root
WorkingDirectory=/opt/lookyloo
Environment=PATH="/usr/bin/python"
ExecStart=/opt/lookyloo/bin/start.py
Environment=LOOKYLOO_HOME=/opt/lookyloo

[Install]
WantedBy=multi-user.target

And I'm getting the following error:

# sudo systemctl status lookyloo
● lookyloo.service - uWSGI instance to serve lookyloo
   Loaded: loaded (/etc/systemd/system/lookyloo.service; disabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2019-04-04 13:47:44 CEST; 2min 48s ago
  Process: 3857 ExecStart=/opt/lookyloo/bin/start.py (code=exited, status=126)
 Main PID: 3857 (code=exited, status=126)

Apr 04 13:47:44 server systemd[1]: Started uWSGI instance to serve lookyloo.
Apr 04 13:47:44 server systemd[1]: lookyloo.service: Main process exited, code=exited, status=126/n/a
Apr 04 13:47:44 server start.py[3857]: /usr/bin/env: ‘python3’: Not a directory
Apr 04 13:47:44 server systemd[1]: lookyloo.service: Failed with result 'exit-code'.

stawp taking over the page onmouseover with the screenshot of the initial page.

this is a bad behavior in general, but makes looklyloo unusable on lower res displays (allow minimization, dismissal? at least require click)

Add basic user agent support

A few user agents, and free text box for folks who want to shoot themselves in the foot. (with a link to info on user agents so they can avoid their feet if they like)

Clicking on a node circle backgrounds any open window

after you open a node window, if you click on a node circle to expand or collapse a part of the tree, it backgrounds the window (and all other windows) until you close and open them again.

Search box for UUID (hostname or url node)

Each Node (hostname tree and URL tree) has a UUID, adding a searchbox to put a UUID in in he main page -> load the tree and put a red box around the node.

Dependencies:

Dump a pickled tree to keep the UUIDs after first generation
For each pickle, dump the list of all UUIDs (Hostname/URL) in the directory for searching later

Requirements:

Force delete pickle for a tree (needs confirm box)

Link overlay box to source node

When the user clicks on a hostname, or an icon, it loads an overlay box that can be moved around.

The box needs to be connected to the originating node.

pop up inheritance line clutter/ambiguity

grey out inheritance line where it crosses its own window to give the sense of "passing behind" (draw it first before drawing window)

Docker-compose failes on initializing Async-scraper

Hi,

today I wanted to setup a docker container and faced the following issue. All previous 16/19 steps went well. Could someone have a look and advise how to fix it? Thank you.

Step 17/19 : run nohup pipenv run async_scrape.py
---> Running in 0197ffd4a2bc
Loading .env environment variables…
09:06:05 AsyncScraper INFO:Initializing AsyncScraper
Traceback (most recent call last):
File "/root/.local/share/virtualenvs/lookyloo-lb761Agm/lib/python3.6/site-packages/redis/connection.py", line 538, in connect
sock = self._connect()
File "/root/.local/share/virtualenvs/lookyloo-lb761Agm/lib/python3.6/site-packages/redis/connection.py", line 861, in _connect
sock.connect(self.path)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/.local/share/virtualenvs/lookyloo-lb761Agm/bin/async_scrape.py", line 7, in
exec(compile(f.read(), file, 'exec'))
File "/root_lookyloo/lookyloo/bin/async_scrape.py", line 36, in
m = AsyncScraper()
File "/root_lookyloo/lookyloo/bin/async_scrape.py", line 24, in init
self.lookyloo = Lookyloo(loglevel=loglevel, only_global_lookups=only_global_lookups)
File "/root_lookyloo/lookyloo/lookyloo/lookyloo.py", line 45, in init
if not self.redis.exists('cache_loaded'):
File "/root/.local/share/virtualenvs/lookyloo-lb761Agm/lib/python3.6/site-packages/redis/client.py", line 1307, in exists
return self.execute_command('EXISTS', *names)
File "/root/.local/share/virtualenvs/lookyloo-lb761Agm/lib/python3.6/site-packages/redis/client.py", line 836, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/root/.local/share/virtualenvs/lookyloo-lb761Agm/lib/python3.6/site-packages/redis/connection.py", line 1071, in get_connection
connection.connect()
File "/root/.local/share/virtualenvs/lookyloo-lb761Agm/lib/python3.6/site-packages/redis/connection.py", line 543, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 2 connecting to unix socket: /root_lookyloo/lookyloo/cache/cache.sock. No such file or directory.
ERROR: Service 'lookyloo' failed to build: The command '/bin/sh -c nohup pipenv run async_scrape.py' returned a non-zero code: 1

Note or future: foreign objects

=> https://jsfiddle.net/thudfactor/bK6VD/

show redirects vertically rather than horizontally?

Because they don't return resources to the browser I think redirects are qualitatively different from other reference types like script and css sources and iframes, but they currently manifest in the same way as depth in the tree. Since redirects typically happen before resources are loaded there would generally be lots of extra vertical space available in the earlier parts of the tree, so perhaps they could be oriented vertically to emphasize this difference? For example cnn.com (https://lookyloo.circl.lu/tree/5ea5cebb-9223-42db-bdeb-34543b237b05) shows

cnn.com --> www.cnn.com --> www.cnn.com --> edition.cnn.com --> ... resources ...

would it be possible to get them to render more like this

cnn.com
   V
www.cnn.com
   V
www.cnn.com
   V
edition.cnn.com --> ... resources ...

Anonymous submit.

It will be nice to have a "don't remember me " button which allow the scanned website to not be published. ( PORN^WGDPR need )

Missing icons

File types:

Buttons:

Download URL content
Display URLs related to the domain

orphan url should not be clickable

Scraping improvements

Proxy support
Pass a pre-generated cookie
Initial referrer
Locale of the browser
Login creds <= how to pass them properly in the webpage will be challenging (solved by passing a valid cookie)

Add collections

The possibility to "group" scan results.

Perhaps via tags or similar.

e.g: cdn.foo.example could be a group of all the sites using that cdn.

But perhaps thinking about "real" correlations would be more efficient.