Coder Social home page Coder Social logo

surrealyz / disinfo-infra-public Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ahounsel/disinfo-infra-public

0.0 0.0 0.0 10.37 MB

Source code and training data for the academic paper "Identifying Disinformation Websites Using Infrastructure Features"

JavaScript 7.41% HTML 10.23% Python 82.37%

disinfo-infra-public's Introduction

Identifying Disinformation Websites Using Infrastructure Features

Source code and training data for the academic paper available here.

Structure

  • bin - Contains all entry points for the system

  • system - All source code for system, including fetching and classifying data

  • analysis - All code to analyze classification performance

Installation

Steps to Develop on disinfo-infra

  1. Create Python virtual environment (optional) to avoid conflicts with local packages : python3 -m venv name-of-environment
  2. Activate the virtual environment to install packages in isolated environment: source name-of-environment/bin/activate
  3. Install required dependencies for disinfo-infra via pip: pip install -r requirements.txt
  4. Setup development environment install via setup.py: python setup.py develop This allows changes to be tracked while developing without re-installation on each change
  5. Develop!
  6. If you want to deactivate the virtual environment: deactivate

Steps to install and run code (NOT FOR DEVELOPMENT)

  1. Download source
  2. pip install -r requirements.txt
  3. python setup.py install

Entry Points

disinfo_net_data_fetch.py - continually fetches new domains and raw data for those domains, implemented domain pipes include reddit, twitter, certstream, and domaintools.

disinfo_net_train_classifier.py - script that trains the classifier from designated training data.

disinfo_net_classify.py - script to classify raw data fetched by disinfo_data_fetch.py. It extracts features, classifies websites, and inserts them into a database table named by the user. It can be run in "live" mode where it constantly classifies new domains as they are fetched from disinfo_data_fetch.py or it can classify an entire database of candidate domains at once.

System Structure

  1. Orchestrate - contains a conductor class that handles thread creation for domain pipes and worker threads, a worker thread class that fetches raw data for a domain, and a classification thread class that extracts features from raw data and classifies them.

  2. Classify - Classifier class that has classes and functions for training, extracting features, and classifying candidate domains

  3. Features - Classes to both fetch raw data and extract features from that raw data.

  4. Pipe - contains an abstract base class for domain pipes that creates a standard interface for what the system expects when a domain is processed: current implementations of this ABC include Reddit, Twitter, Certstream, and DomainTools domain pipes

  5. Postgres - Classes to interact with a postgres database including inserting, checking, and retrieving data.

  6. util - various utility classes including classes to unshorten urls, get tlds, and determining ownership of ip addresses

Database Entries

Our system works in two parts, a data fetching script, which inserts raw data into a database table structured as follows:

 Attribute (Type) {
 domain (Text) (Primary Key),
 certificate (Text)
 whois (Text),
 html (Text),
 dns (Text),
 post_id (Text)
 platform (Text)
 insertion_time (UTC) 
}

Where each attribute is:

  • domain - unique domain which the rest of the data is associated with
  • certificate - the certificate, in raw string format, of the domain
  • whois - the whois response in raw string format
  • html - the raw HTML source of the homepage of the domain
  • dns - the IP address(es) that the domain was found to map to
  • post_id - the unique post id on the given platform of the post (for verification purposes)
  • platform - the platform on which the domain was posted
  • insertion_time - time of insertion into the database

The second part of our system, which classifies a domain given raw data about the domain, inserts those classifcations into a database table structured as follows:

 Attribute (Type) {
 domain (Text) (Primary Key),
 classification (Text) (one of: unclassified, non_news, disinformation)
 probabilities (JSON),
 insertion_time (UTC) 
}

Where each attribute is:

  • domain - unique domain which the rest of the data is associated with
  • classification - the actual classification of the domain by the classifier, one of: news, non_news, disinformation
  • probabilities - the probabilities of each class mentioned above in a JSON dictionary format
  • insertion_time - time of insertion to the database

Finally, we have a prepopulated training database, including raw data of all of our training data, in the format of:

 Attribute (Type) {
 domain (Text) (Primary Key),
 target (Text) one of: unclassified, non_news, disinformation)
 certificate (Text)
 whois (Text),
 html (Text),
 dns (Text),
}

Where each attribute is the same as our raw data table, with target being the known label for the domain.

Using the Chrome extension

Navigate your Chrome browser to chrome://extensions and enable developer mode in the top right.

Click "Load Unpacked" and upload the contents of the src/plugin/ directory.

Chrome developer tutorial

Navigate to any of the sites listed in src/plugins/classified_sites.txt (for example, needtoknow.news) and you will see the warning message.

disinfo-infra-public's People

Contributors

ahounsel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.