freedomofpress / fingerprint-securedrop Goto Github PK

A machine learning data analysis pipeline for analyzing website fingerprinting attacks and defenses.

License: GNU Affero General Public License v3.0

Python 57.83% Shell 0.54% Go 5.65% Jupyter Notebook 35.98%

securedrop machine-learning tor onion-service hidden-service website-fingerprinting traffic-analysis

fingerprint-securedrop's Introduction

fingerprint-securedrop

This repository is a work-in-progress to implement an end-to-end data collection and analysis pipeline to help tackle the problem of website fingerprinting attacks in the context of Tor Hidden Services [1]. It is designed as a single system that carries out everything from data collection, to feature generation, to model training and analysis, with the intention of helping us evaluate and develop defenses to be implement in the SecureDrop whistleblower submission system.

If you are a researcher interested in this problem we encourage you to collaborate with us in our Gitter chatroom and via our mailing list. Feel free to get in touch personally as well.

The pipeline works as follows:

sorter.py scrapes Hidden Service directories, and visits every .onion URL it finds. It groups sites into two classes: SecureDrop and non-monitored.
crawler.py fetches sites from these classes and records the raw Tor cells.
features.py generates features based on these raw Tor cells.
The model training, classification, and presentation of results (graph generation) code is still in development.

Our hope is that later we will be able to make this code more composable. There has already been some effort in that direction, and it should be pretty easy to use at least the sorter and crawler if you're interested in monitoring a site besides SecureDrop.

Getting Started

Dependencies

Ansible >= 2.0
Vagrant
VirtualBox

Provisioning a local VM

cd fingerprint-securedrop
vagrant up
vagrant ssh
cd /opt/fingerprint-securedrop/fpsd

Running the Sorter

./sorter.py

To look at the sorter log while it's running run less +F logging/sorter-latest.log. If you're not using the database, data will be timestamped with logging/class-data-latest.pickle being symlinked to the latest data. Otherwise, run psql and poke around the hs_history table.

Running the Crawler

./crawler.py

To look at the crawler log while it's running run less +F logging/crawler-latest.log, and to look at the raw Tor cell log run less +F /var/log/tor_cell_seq.log. You can also check out the traces it's collecting as it runs: cd logging/batch-latest, or look at the frontpage traces and other related tables (see the Database Design section).

A systemd unit is also provided to run the crawler on repeat. Simply run sudo systemctl start crawler to start the crawler running on repeat.

Using PostgreSQL for data storage and queries

The data collection programs—the sorter and crawler—are integrated with a PostgreSQL database. When the use_database option is set to True in the [sorter] section of fpsd/config.ini, the sorter will save its sorted onion addresses in the database. When the use_database option is set to True in the [crawler] section of fpsd/config.ini, the crawler will grab onions from the database, connect to them, record traces, and store them back in the database. You can also use a remote database by configure the [database] section of fpsd/config.ini.

By default, a strong database password will be generated for you automatically and will be written to /tmp/passwordfile on the Ansible controller, and saved to a PGPASSFILE, ~{{ ansible_user }}/.pgpass on the remote host (if you want to set your own password, I recommend setting the PGPASSWORD Ansible var before provisioning--as a precaution re-provisioning will never overwrite a PGPASSFILE, but you can also do so yourself if you wish to re-configure your database settings). Environment variables are also be set such that you should be able to simply issue the command psql to authenticate to the database and begin an interactive session.

Database Design

We store the raw data in the raw schema and the derived features in the features schema. The sorter writes to raw.hs_history, inserting one row per sorted onion address. The crawler reads from raw.hs_history and writes one row per crawl session to raw.crawls, one row per trace to raw.frontpage_examples, and one row per cell in the trace to raw.frontpage_traces.

The current design of the database is shown in the following figure:

fingerprint-securedrop's People

Contributors

Stargazers

Watchers

Forkers

redshiftzero enterstudio mydrone daprueba666 mshtyusuf pombredanne todun ayush9000 nk786786

fingerprint-securedrop's Issues

Write Ansible tasks for feature generation

The changes presented in #30 include strong documentation about the relevant setup tasks: https://github.com/freedomofpress/FingerprintSecureDrop/blob/f59a82f66c31fcdd040ffc48c35c0bc2c2466b9f/fpsd/test_db_setup.md

Run through that document and create Ansible tasks enforcing the state described. Both @fowlslegs and @redshiftzero should sign off on the new tasks, since they'll be altering database structure.

Try each onion site more than once in the sorter

there are ~7k+ onion sites in our lists, however we can reliably reach around ~2k with the sorter. we can probably increase this number by retrying onion sites that we don't get back 200 OK on the first GET request

Crawler stalling indefinitely--cause unknown

http://xnsoeplvch4fhk3s.onion/ stalls the crawler indefinitely. The 20s page load timeout variable should kill the connection, but for some reason Selenium fails to do so with this site.

Here's the Firefox log:

[07-18 18:00:04] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/ via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/style.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/effects.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/prettyPhoto.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/css_002.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jss-style.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/attentionGrabber_css.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/wp-customer-reviews.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/woocommerce.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/css.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/css3_grid_style_002.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/css3_grid_style.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/styles.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery_002.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/agent.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/default.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/rounded.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/custom_002.htm via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/converter.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/social-product-automation.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/faq.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/ga_002.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/ga.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery-2.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jss-script.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/attentionGrabber_js.htm via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/sws_frontend.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/wp-customer-reviews.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:09] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/comment-reply.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/iphorm.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/swfupload_002.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/swfobject.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/swfupload_003.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/swfupload.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery-migrate.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/social-product-automation.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/superfish.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/general.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/slides.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/affiliate_platform_style.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/black.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/shortcodes.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/custom.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/select-package.png via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/featured-tag.png via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/starttag.png via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/tick_04.png via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/TwitterFollowers-Payments-Badges-New1a.jpg via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/TwitterFollowers-Payments-Badges-New1b.jpg via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/logos2.png via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/Twitter001.png via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/1369009171_twitter_bird_blueprint-social.png via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/1364267098_anonymous.png via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/guarantee4.jpg via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery-ui-1.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery_008.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery-ui-1.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/des_expander.css via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/money.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/cookie.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/folding.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery_007.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery_004.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery_002.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery_006.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery_005.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery_003.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/rounded.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/jquery-plugins.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/woocommerce.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/des_expander.js via xnsoeplvch4fhk3s.onion:0
[07-18 18:00:10] Torbutton INFO: tor SOCKS: http://xnsoeplvch4fhk3s.onion/amazongc_files/css/reset.css via xnsoeplvch4fhk3s.onion:0

Here's the traceback after I killed the crawler with ^C:

noah@hs-crawler-nyc:~/FingerprintSecureDrop/fpsd$ ./crawler.py
^C[tbselenium] Request-sent
Traceback (most recent call last):
  File "./crawler.py", line 212, in collect_onion_trace
    self.crawl_url(url)
  File "./crawler.py", line 270, in crawl_url
    wait_for_page_body=True)
  File "/home/noah/FingerprintSecureDrop/fpsd/tor-browser-selenium/tbselenium/tbdriver.py", line 156, in load_url
    self.find_element_by("body", find_by=By.TAG_NAME)
  File "/home/noah/FingerprintSecureDrop/fpsd/tor-browser-selenium/tbselenium/tbdriver.py", line 163, in find_element_by
    EC.presence_of_element_located((find_by, selector)))
  File "/usr/local/lib/python3.5/dist-packages/selenium/webdriver/support/wait.py", line 71, in until
    value = method(self._driver)
  File "/usr/local/lib/python3.5/dist-packages/selenium/webdriver/support/expected_conditions.py", line 59, in __call__
    return _find_element(driver, self.locator)
  File "/usr/local/lib/python3.5/dist-packages/selenium/webdriver/support/expected_conditions.py", line 274, in _find_element
    return driver.find_element(*by)
  File "/usr/local/lib/python3.5/dist-packages/selenium/webdriver/remote/webdriver.py", line 744, in find_element
    {'using': by, 'value': value})['value']
  File "/usr/local/lib/python3.5/dist-packages/selenium/webdriver/remote/webdriver.py", line 231, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/usr/local/lib/python3.5/dist-packages/selenium/webdriver/remote/remote_connection.py", line 395, in execute
    return self._request(command_info[0], url, body=data)
  File "/usr/local/lib/python3.5/dist-packages/selenium/webdriver/remote/remote_connection.py", line 426, in _request
    resp = self._conn.getresponse()
  File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
    response.begin()
  File "/usr/lib/python3.5/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.5/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.5/socket.py", line 575, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./crawler.py", line 466, in <module>
    ratio=int(config["monitored_nonmonitored_ratio"]))
  File "./crawler.py", line 437, in crawl_monitored_nonmonitored_classes
    trace_dir=nonmon_trace_dir)
  File "./crawler.py", line 398, in collect_set_of_traces
    retry=False)
  File "./crawler.py", line 387, in collect_set_of_traces
    iteration=iteration) == "failed"
  File "./crawler.py", line 225, in collect_onion_trace
    self.controller.get_circuits()
  File "/usr/local/lib/python3.5/dist-packages/stem/control.py", line 414, in wrapped
    raise exc
  File "/usr/local/lib/python3.5/dist-packages/stem/control.py", line 409, in wrapped
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/stem/control.py", line 3035, in get_circuits
    response = self.get_info('circuit-status')
  File "/usr/local/lib/python3.5/dist-packages/stem/control.py", line 414, in [07-18 17:58:24] Torbutton INFO: tor SOCKS: https://fonts.gstatic.com/s/permanentmarker/v5/9vYsg5VgPHKK8SXYbf3sMsW72xVeg1938eUHStY_AJ4.woff2 via cmyaw5mzy7dse3xl
wrapped
    raise exc
  File "/usr/local/lib/python3.5/dist-packages/stem/control.py", line 409, in wrapped
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/stem/control.py", line 1113, in get_info
    raise exc
  File "/usr/local/lib/python3.5/dist-packages/stem/control.py", line 1065, in get_info
    response = self.msg('GETINFO %s' % ' '.join(params))
  File "/usr/local/lib/python3.5/dist-packages/stem/control.py", line 580, in msg
    raise exc
  File "/usr/local/lib/python3.5/dist-packages/stem/control.py", line 563, in msg
    raise response
  File "/usr/local/lib/python3.5/dist-packages/stem/control.py", line 853, in _reader_loop
    control_message = self._socket.recv()
  File "/usr/local/lib/python3.5/dist-packages/stem/socket.py", line 177, in recv
    raise exc
  File "/usr/local/lib/python3.5/dist-packages/stem/socket.py", line 156, in recv
    return recv_message(socket_file)
  File "/usr/local/lib/python3.5/dist-packages/stem/socket.py", line 561, in recv_message
    raise stem.SocketClosed('Received empty socket content.')
stem.SocketClosed: Received empty socket content.

I also tried visiting it on my desktop and no page content would load. From the console:

getFirstPartyURI failed for chrome://browser/content/browser.xul: 0x80070057
[07-18 21:26:11] Torbutton WARN: no SOCKS credentials found for current document.
getFirstPartyURI failed for view-source:http://xnsoeplvch4fhk3s.onion/: no host in first party URI view-source:http://xnsoeplvch4fhk3s.onion/
[07-18 21:26:13] Torbutton WARN: no SOCKS credentials found for current document.

Update Tor Browser to 6.0.5

Just a reminder to TB on the latest version.

Have the sorter save results directly to a database

While the singular class_data pickle files generated by the sorter are much less to deal with and we'll be doing much less analysis of them than the traces from the crawler, the small amount of time it would take to SQLize the sorter would make looking at trends in HS site uptime and reachability easier, and that might be useful in terms of analyzing defense strategies later on in the research process. This is one This is one of several separate issues #16 is being split up into.

Migrate pip-update.sh to Python and have it update repo root level requirements as well

Open world cross-validation

We should implement open-world cross-validation - and the ability to define how much of the world the adversary can see. Both these config options (world: type: 'open' and world: observed_fraction) and corresponding fields to store them in the database (in models.undefended_frontpage_attacks) were created in PR #57. It would be nice if we could use existing code from one of the splitting methods here but writing custom CV may be necessary. If custom CV code is indeed necessary, it should definitely be unit tested.

Also: consider making the observed_fraction in the config take a list of possible values.

Better multi-user system support

Multi-user machines are currently not well supported. To clarify, deploying to multi-user machines works just fine, but one has to re-run the playbook for each user, which is time and resource exhaustive compared to a for-all-users-simultaneously deployment solution. Such a solution would require the following changes (and perhaps more):

Download Tor and Tor Browser to /usr/local/src/tbb instead of ~/tbb.
Set environment vars for database access in /etc/bash.bashrc instead of ~/.bashrc.
Clone the repo itself to /opt/FingerprintSecureDrop instead of ~/FingerprintSecureDrop.

The defaults in the fpsd/{sorter.py,crawler.py,config.ini} should be updated to reflect these changes. Thoughts @conorsch?

Investigate rebalancing the training set

We have a very imbalanced machine learning problem, where we have far fewer SecureDrop users than non-SecureDrop users. There are many ways of handling this situation - including oversampling the minority class or undersampling the majority class. Some of the techniques used for machine learning with very skewed classes are implemented in this library: https://github.com/scikit-learn-contrib/imbalanced-learn, so we could give some of these a try.

Tables + schemas modifications (w/in constraints) should be deployable w/ Ansible, and preserve data

Say we want to add another column to the crawls table--some new metadata item. Under the existing architecture, I'm not quite sure what would happen if we simply change the create_table_crawls.sql file and re-provision. Ideally, we'd be able to make changes that don't break database logic (e.g., removing the fk column from frontpage_examples, which rows in hs_history use to tie themselves to particular crawler runs) w/o the need either for manual intervention or the fear of data loss.

Create a test suite including a test to simulate a remote server disconnecting

Since this is a research tool and not production software, I don't think it's necessary to go overboard here or get complete coverage. That said, an issue has popped up (namely, #4), that I've thus far been unable debug, and having a test suite that did a couple basic tests and then, the more advanced test that is propping up a basic onion service website which (i) allows an initial connection, (ii) receives the GET request from our client, and (iii) drops the connection without response, would help us debug #4.

Don't re-build/install Tor or Tor Browser unnecessarily

With some good use of when: statements in the playbook, we could avoid re-download/build/installing Tor and Tor Browser. This would save us a lot of time when (re-)provisionining.

An original worry with TB was that our profile dir might be polluted as a result of normal operation of our crawler, but this should not actually happen as we're not installing or uninstalling extensions, making bookmarks, or otherwise performing other actions that should result in profile dir pollution. It also seems unlikely we will be doing anything such activities in the future.

As far as Tor goes, most of the tasks involving getting it set up on the system use command: which is not idempotent unless you make it so (using when: or other methods).

Conform ansible_ssh_user reference usage to 2.0 spec

Ansible 2.0 has deprecated the “ssh” from ansible_ssh_user, ansible_ssh_host, and ansible_ssh_port to become ansible_user, ansible_host, and ansible_port. If you are using a version of Ansible prior to 2.0, you should continue using the older style variables (ansible_ssh_*). These shorter variables are ignored, without warning, in older versions of Ansible.

Close the test socket after a free port is found with find_free_port

Just noticed some warnings that don't cause failure while running the tests or sorter or crawler themselves, but should still be addressed. 1-liner socket.close() should do the trick.

Add users who should have access to pipeline to fpsd group--0770 permissions on relevant dirs

This includes /opt/{FingerprintSecureDrop,tbb}, /usr/local/src/tor, and /var/log/tor/ (see #79).

Update the class data pickle file

It's been a few weeks and SecureDrop has also received some version bumps, so the current class data pickle file contains out-of-date data. It should be updated since this project is ongoing and bundling it in this repo eases data collection when deploying to multiple VPS. Note: no potentially deanonymizing information is contained in the class data file; it is simply lists of sorted onion services.

Cleanup how lambdas are passed to the sorter

Save the values of the averaged (over k-folds) ROC curve in the database

Right now we generate a ROC curve that is averaged over all k-folds, but we don't save the FPR and TPR in the database in models.undefended_frontpage_attacks. We should do that such that we can quickly determine what the FPR and TPR is at a given threshold. Note that these are currently generated and saved for each individual fold.

Feature Selection

Many of our features are not very useful. We should include a first step of feature selection before passing the features matrix to the classifier. This could be something simple, e.g. a variance threshold, or something more complex. See a reference here in scikit-learn for how we can do this (no wheel invention necessary).

Crawler is running into terminal connection refused socket failures

Edit: see #4 (comment) for a better explanation and traceback. Don't know why this original report was so half-assed and lacked even the full traceback.

So the crawler is for the most part working very well. Where it runs into problems is what seems to be a Python IO/socket exception (Errno 111). Once it hits this error, it will fail the rest of the way through the crawl pretty instantaneously. See the log at the bottom of this post.

I believe that this is actually cause by a bug in Python3.5--see https://bugs.python.org/issue26402, but this warrants further testing. The PPA we've been using at https://launchpad.net/~fkrull/+archive/ubuntu/deadsnakes?field.series_filter=trusty has not seen an updated version of Python3.5 since December for Ubuntu 14.04 (trusty). This is about our only choice for newer Python versions, and I've already done the work to migrate this script to Python3.5, so we could use a single virtual environment for both the HS sorting and crawling scripts. Since at this point in our research we don't really need to run the sorting script, I think I'll just break compatibility with it by making the necessary changes in the ansible roles to install and use Python3.3 and that should hopefully fix things.

♫ Truckin' ♫
...
06:51:26 http://maghreb2z2zua2up.onion: exception: Remote end closed connection without response
06:51:26 http://radiohoodxwsn4es.onion: loading...
06:51:26 http://radiohoodxwsn4es.onion: exception: [Errno 111] Connection refused
06:51:26 http://tqjftqibbwtm4wmg.onion: loading...
06:51:26 http://tqjftqibbwtm4wmg.onion: exception: [Errno 111] Connection refused
06:51:26 http://newstarhrtqt6ua7.onion: loading...
06:51:26 http://newstarhrtqt6ua7.onion: exception: [Errno 111] Connection refused
...
And so on (fails through the rest of the URLs pretty instantly.

https://bugs.python.org/issue26402

Consider using aiosocks library instead of chaining proxies for the sorter

So, asyncio and aiohttp being very new libraries, there didn't exist anything like https://github.com/nibrag/aiosocks when I first wrote the sorter, so the sorter connects to Privoxy running as a HTTP proxy on localhost, and then Privoxy in turn connects to the tor SOCKS port and passes the traffic on. It would be preferable to not use Privoxy, so it would be good to try aiosocks out and see if I can get it working with the sorter.

Look into stem's irregular logging behaviour

I noticed during the writing of test_database_methods.py that stem logs related to Crawler activity appear in the Sorter's log. IIRC, stem tries to find existing loggers in the thread and uses them. This approach seems problematic in the case where multiple are running in the same thread. Related: #34.

Slightly rewrite sorter role to allow provision of a sorter-only system

Right now the sorter role depends on the crawler role for the tor binary, so while one could pass ANSIBLE_ARGS=--skip-tags=sorter and have just a crawler provisioned, if you ran ANSIBLE_ARGS=--skip-tags=crawler you'd be left with a sorter environment all configured except without Tor. The best solution seems to be to for the sorter role to check to see if there is a tor binary that has been compiled by the crawler role, and if not to install the latest version of tor from the Tor deb repos (not the Debian repos, which sometimes are out of date). Installing from a package will be significantly faster than building Tor and the modifications we make to relay.c are only needed by the crawler.

sudo unable to resolve host under Xenial in VirtualBox

Perhaps due to changes in 0fe85ee, when running the crawler with Vagrant a warning message will be printed out about being unable to resolve host whenever one runs sudo.

This is resolvable by adding the line 127.0.1.1 ubuntu-xenial to /etc/hosts. Whether we should worry about this, when sudo usage is not even necessary for normal crawler usage and this issue may be resolved when the lack of vboxfs extension in the Xenial image is corrected by the Ubuntu team upstream.

Run multiple sorters/crawlers in the same or different Python processes worry free

The sorter and crawler were written with one use case in mind: run one per Python process and only run one Python process at a time. That sucks, as I realized writing a test case for the database suite. Some things that wouldn't suck:

Ability to specify Tor bind to a specific port (default 9050) if free, or otherwise find an open port to use.
- The same goes for Tor data directories--we could perhaps create ephemeral ones if the default, /var/lib/tor, is in use.
Ability to connect to an already running Tor process.
The ability to start a tor process via the __init__ of one Sorter or Crawler object, and then pass a particular attribute of that object to the constructor of another Sorter or Crawler object, to let it share the same Tor process.
Explicitly kill the Tor process both in the __exit__ of both the Sorter and Crawler classes, and in a special close method for when with context management is not desired.

But there's still a big problem to be worked out:

If a Crawler object shares a Tor process with another Sorter/Crawler object, it's traces will be muddied by the other object.

A potential way to change this would be to extend the relay.c.patch to create a --CellLog <file> option to allow specification of the where the cell logs should go (instead of the hardcoded $HOME/FingerprintSecuredrop/fpsd/logging/tor_cell_seq.log file).

This last problem is something that we can warn about for the time being since the fix would be rather time-consuming considering my modest C background. The first block of functionality should be implemented soon, however.

Sorter doesn't respect order of class tests

This is because a collections.OrderedDict object, when initialized with multiple (key, value) pairs will not respect the ordering you pass those multiple entires in. Instead, they need to be added one at a time from first to last.

Install ipython and pytest with pip during provisioning

Just because these are useful tools. pytest will probably be used in the fpsd/run-tests.py script in the future, so it will probably end up in test-requirements.{in,txt} during the tests overhaul.

Database integration

Right now we're generating a lot of data that gets stored across many small files. This data situation is quickly going to become a mess, so we should get more organized by having our data collection code - the sorter/crawler - automatically upload its measurements into the relevant tables in a database each time it runs. Given the amount of data we have, PostgreSQL should suffice. I propose we have a separate schema raw that will store the raw training examples. Features derived from these raw measurements can be stored in a separate schema features, and results from our classifier experiments should be uploaded into another schema ml. Here's a proposed initial design for this first schema raw for the measurement task we are focused on currently, collecting data from HS frontpages:

The table frontpage_examples contains a row for every measurement of a given HS that we take with primary key exampleid. It links to the crawlers table with primary key crawlerid which describes information about the measurement conditions. The raw cell traces will be inserted into frontpage_traces and link back to frontpage_examples via exampleid. This structure enables us to very quickly select train/test sets in SQL with a couple of simple joins based on attributes we might be interested in: timestamp, url, crawler AS, sd_version, and so on.

Rewrite pip_update.sh in python

Not high-priority, but could be a model for freedomofpress/securedrop#1409.

Add tags to the playbook

Currently we have separated out the crawler and sorter roles in Ansible, which is useful for speeding up provisioning, especially because one may do a whole lot more sorting than crawling. Instead of having to go into the playbook and un/comment lines to un/ignore particular roles, it would be easier to just be able to pass the --tags or --skip-tags options.

Some pages load for more than 20s

Separate from issue #21, I've noticed sometimes pages will load for longer than the 20s timeout set, but do eventually time out. It seems that sometimes the timeout will not occur until 2 minutes--exactly--into the page load attempt. For example, I just watched this happen with http://vtduisq4g6heuzom.onion. Other times it is a more irregular timeout such as 1m19s, or some other value usually in the (20s, 120s) interval.

Improve source code documentation with gnupg-python style docstrings and docstring tests

Of Python libraries that come to mind, python-gnupg has the nicest docstrings in my opinion. Poke around https://github.com/isislovecruft/python-gnupg/blob/master/gnupg/. You learn:

All parameters you can pass.
The type of each of them.
A concise description of what each of them does.
What exception(s) may be raised.
What object(s) may be returned.

Further, the inclusion of examples is not only educative, but can become part of our tests with doctest. Things are harder to break too, when you're reminded in the source code of exactly how they're supposed to work and what types that are supposed to take and return.

In the process of writing this for the Crawler and Sorter, we would also properly work out keyword unpacking such that the user may pass the full range of initialization options (or other arguments when calling methods) supported by each class it wraps (giving greater control of the Tor process and controller, the Firefox process and driver, Xvfb, etc.). The docstring for the initialization of these internal objects in particular should be divided into sections, and might even have to omit certain options as it may turn out to be excessive. We'll also have to make sure there are no parameter name conflicts.

I think overall this can make more accessible a lot of functionality that already exists in the Crawler, can help catch mistakes before they happen, and should overall improve code quality.

Related: #28.

Have the crawler save data directly to a database

Right now all data is saved to log files which makes management messy, especially across servers. Having all data in SQL database will make analysis much easier, and it's best if our crawler would just save what it collects directly to such a database. This is one of several separate issues #16 is being split up into.

Have the postgres user do all db init

It needs to be shipped

@conorsch Hoping you can help me out. Code is ready, it just needs to be shipped to multiple DO servers 'round the 🌎. If I could provision all of them with in one go that'd be awesome!

SF 🌉
New York 🚇
London 🍵

Refactor crawler and sorter to be more composable

In trying to debug #4, I've realized just how uncomposable the Crawler class is. In order to enable other people to reuse this code (let to write tests for the crawler and sorter themselves), some reorganization could be really useful. One thing that immediately stood out is that the context management for TB and Xvfb is done within the crawl_classes() method. These two things should probably be moved to crawl_class(), so that a list of URLs can be easily crawled (e.g., for testing, one can enter in a small list or even single site in the testing code). This allows you to skip through the weird rotation logic--which we may somehow want to reconsider how we implement as well. At the least, we could rename the sets as monitored and non-monitored--since it's likely that other researchers would like to fetch many more copies of they're monitored sites than non-monitored ones for an open-world setting, and there already is a tunable infrastructure in place for this anyway.

I'm creating a checklist that will be edited as I think of more ways to improve the composability of these scripts. At some point, I'll decide sufficient work has been done toward this extent and close this issue--leaving the exact endpoint somewhere in the air for now.

Move Xvfb context management out of crawl_classes()
Move TB context management out of crawl_classes()

Crawl in parallel

The crawler currently fetches each site one at a time. That was easiest to implement and ensures clean traces. With the introspection into circuits that stem gives, we should be able to identify (circuit, site-instance) tuples. This info given to a modified record_cell_seq() method could allow us to separate the cells from the different circuits and still get clean traces. Of course, we'd need to save some file pointer state, creating instead a (circuit, site-instance, start_file_ptr) tuple, since we need to keep track of what time span we should be looking for cells from each instance in our tor_cell_log.

Unclear how much work this would require and if it could potentially muddy our results by just creating an unrealistic amount of Tor traffic that slows down the loading of each instance.

Crawler cannot read from file and write to database (or vice versa)

Currently, if one sets use_database=True in fpsd/config.ini under the [crawler] section, then the Crawler will attempt to both read and write from the fpsd (by default) database. This is a problem when one wants to test just the Crawler locally, as we do not initiliaze the hs_history table with any data. It would be nice to be able to read from a local pickle file and write to the database. Just for testing purposes in a local VM. The Sorter should probably always write a pickle file (and we should at least try to use that to keep the one in this repo from getting too stale). This is not super important.

For future reference, @conorsch, I would just run python3 -m pytest test/test_sketchy_sites.py to test the crawler. This would be good for testing the systemd especially because it should complete within a couple minutes, and you can test that it's being restarted.

Generate plot of precision/recall as a function of k

We want to know in a realistic scenario - i.e. one that incorporates the effect of the class imbalance - how effective these attacks are in terms of true and false positives. A really nice plot that would show this (right now the machine learning pipeline generates only an ROC curve) is a graph of precision and recall as a function of k, the percent of the ranked list flagged. Let's add this to evaluate.py.

Also: see Figure 5 in this paper to see a nice comparison between ROC curves and precision/recall graphs in the presence of different base rates.

Use new tbselenium package instead of submodule

Should make our lives slightly easier: webfp/tor-browser-selenium#59 (comment).

Ditch virtualenvs

Virtualenvs where originally brought in because tbselenium was not Python3 compatible. We could probably ditch them now and just run sudo pip install (since we're in a VM anyway, this shouldn't matter). This would simplify provisioning.

Look into unclosed file warning for tbselenium

There might be a bug upstream. Low-priority because this shouldn't actually create any problems:

test_crawl_of_bad_sites (test.test_sketchy_sites.CrawlBadSitesTest) ... /usr/local/lib/python3.5/dist-packages/tbselenium/tbdriver.py:289: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/vagrant/tbb/tor-browser_en-US/Browser/TorBrowser/Docs/sources/versions' mode='r' encoding='ISO-8859-1'>
  for line in open(version_file):

Make results of crawl immediately ready for analysis w/ go-knn

In 52a6ab4, I laid out a plan for the completion of work that needs to be done to finish the data collection mechanism such that data is immediately available for use by https://github.com/pylls/go-knn. The individual tasks to make this happen are as follows:

Define sd_sample_ratio in config.ini, and in the crawler do the following: (i) divide the not-sd class into sd_sample_ratio chunks and (ii) crawl the SD class once before each crawling each chunk of the not-sd class. This is following the practices of other researchers who collected numerous times more traces of the monitored class vs. the non-monitored class. Since our monitored class consists of just SDs, where each site has very little variation, we really want to hone in on it's specific variations, whereas, the non-monitored (i.e., not SDs) class being so large, we're just getting a general idea of the non-SD hidden service space, and since it's a much larger class and widely varied in terms of features, collecting a significant amount of traces shouldn't greatly change the average of the features of this class.
Following Wang et al. (WPES 2013, Appendix C) after the initial onload event of each site, we should (i) pause 5s to collect the traffic after the initial onload event, (ii) close the tab and start a new tab (or close TB if that's easier--depends on the capabilites tb-selenium has working reliably), (iii) close all open streams using stem.
During this whole process we will have tor_cell_seq.log open read only. Right before loading each site we will move our pointer in the file object up to the end of the file and after step (iii) from above we will read the to the end of the log and copy over the data as described in 52a6ab4. We will not copy over tor cells that were part of client <-> IP or client <-> RP circuit construction to the <index>-<iteration> file; we will only copy over the tor cell sequence data that was sent to and from the HS over the client <-> RP circuit. To slightly elaborate on how <index> is derived, during initialization of the crawler the crawler will store self.sds_ord = len(self.sds), the order of the SD class. The SD class will have indices 0, ..., self.sds_ord - 1 and the the non-SD class will have indices self.sds_ord, self.sds_ord + 1, ... (or I may bump them up by one if go-knn is expecting indices to start at 1--TBD). The iteration should be pretty self explanatory.

Implement crawler systemd logging support w/ StreamHandler and argparse

Config tasks should ensure crawler service is running

Since merging #54, we now have a systemd service file that manages running the crawler over time, including automatic restarting. The Ansible logic does not ensure that the crawler service is started, however—developers must still start crawls manually.

Let's add a service task to the crawler provisioning logic to ensure the crawler gets started and enabled (meaning it will start after a reboot as well).

Model evaluation over a range of base rates

We don't really know the base rate of SecureDrop usage: yay anonymity. However, this means that in order to see how many users an attacker would correctly and incorrectly flag we should evaluate each model's precision and recall over a range of base rates - and of course save the results in the database. Note that in PR #57 I created the field testing_class_balance to keep track of this in table models.undefended_frontpage_attacks.

Re-name postgres-schemas dir to raw-tables

Implement a robust test suite and a .travis.yml

We should implement a robust test suite that covered all operations of the crawler and sorter. Then we can also take advantage of Travis for automatic integration tests instead of manually doing all testing. This issue is a WiP, the contents subject to change as we decide how we want to implement what tests. Feel free to edit @redshiftzero

Deployment

Prevent regressions that may come with new tor versions by confirming that tor builds correctly after being patched and that the output to our log file looks as expected.

Sorter

Crawler

Show how to take a screenshot with the post-GET function (extra_fn) you can specify/ test this functionality works as expected. Maybe confirm the screenshot PNG hash of a very static page is as expected.
Manually raise some troublesome exceptions in the extra_fn and test they're handle correctly, produce tracebacks in logs, don't crash the crawler, etc..
Show restart method works correctly (also a good time to test it for recovery from _sketchy_exceptions.
Be able to work through a list of sites known to crash the crawler (previously) and show now that the crawler got that on lock.

Database

Sort the SecureDrop directory into up-to-date (monitored) and out-of-date (non-monitored) classes, then query the database for the onions that were sorted since the start of the test. Verify that the objects returned are the correct type and have sensible contents. Also verify that the SQL entries look like they should.
- Following up on this, retrieve this data for a crawler run. Make sure the crawler can process the class data. After the crawler has without exception uploaded it's own results to the database, do some verification that the corresponding entries look sane.
  - Eventually follow up with a third step that makes sure the (not yet complete) feature generation/ feature schema related code integrates with the above.

write burst generation code in sql

Rewrite the burst generation code currently performed in FeatureStorage.create_bursts() in features.py in SQL such that feature generation is fast and scalable

Look into unclosed transport warning for asyncio