Coder Social home page Coder Social logo

teamhg-memex / sitehound-frontend Goto Github PK

View Code? Open in Web Editor NEW
23.0 16.0 13.0 18.38 MB

Site Hound (previously THH) is a Domain Discovery Tool

License: Apache License 2.0

Python 19.89% CSS 4.72% JavaScript 27.84% HTML 46.73% Shell 0.11% Roff 0.71%
domain-discovery topics

sitehound-frontend's Introduction

Site Hound

Site Hound (previously THH) is a Domain Discovery Tool that extends the capabilities of commercial search engines using automation and human-in-the-loop (HITL) machine learning, allowing the user efficiently expand the set of relevant web pages within his domain/s or topic/s of interest.
Site Hound is the UI to a more complex set of tools described below. Site Hound was developed under the Memex Program by HyperionGray LLC in partnership with Scrapinghub, Ltd. (2015/2017)

Main Features

  1. Role Based Access Control (RBAC).
  2. Multiple workspaces for keeping things tidy.
  3. Input of keywords, to be included or excluded from searchs.
  4. Input of seeds URLs, an initial list of websites that you already know are on-topic.
  5. Expand the list of sites by fetching the keywords on multiple commercial search engines.
  6. Displays screenshots (powered by Splash), title, text, html, relevant terms in the text
  7. Allows the user to iteratively train a topic model based on these results by assigning them into defined values (Relevant/Irrelevant/Neutral), as well as re-scoring the associated keywords.
  8. Allows an unbounded training module based on user-defined categories.
  9. Language detection (powered byApache Tika) and page-type classification (powered by HG's https://github.com/TeamHG-Memex/thh-classifiers)
  10. Allows the user to view the trained topic model through a human-interpretable explaination of the model powered by our machine learning explanation toolkit https://github.com/TeamHG-Memex/eli5
  11. Performs a broad crawl of thousand of sites, using Machine Learning (provided by https://github.com/TeamHG-Memex/hh-deep-deep) filtering the ones matching the defined domain.
  12. Displays the results in an interface similar to Pinterest for easy scrolling of the findings.
  13. Provides summarized data about the broad crawl and exporting of the broad-crawl results in CSV format.
  14. Provides real time information about the progress of the crawlers.
  15. Allows search of the Dark web via integration with an onion index

Infrastructure Components

When the app starts up, it will try to connect first with all this components

  • Mongo (>3.0.*) stores the data about users, workspace and metadata about the crawlings
  • Elasticsearch (2.0) stores the results of the crawling (screenshots, html, extracted text)
  • Kafka (8.*) handles the communication between the backend components regarding the crawlings.

Custom Docker versions of these components are provided with their extra args to set up the stack correctly, in the Containers section below.

Service Components:

This components offer a suite of capabilities to Site Hound. Only the first three components are required.

  • Sitehound-Backend: Performs queries on the Search engines, follows the relevant links and orchestrates the screenshots, text extraction, language identification, page-classification, naive scoring using the cosine difference of TF*IDF, and stores the results sets.
  • Splash: Splash is used for screenshoot and html capturing.
  • HH-DeepDeep: Allows the user to train a page model to perform on-topic crawls
  • THH-classifier: Classifies pages according to their type (i.e. Forums, Blogs, etc)
  • Dark Web index: This is currently a private db. Ask us about it.

Here is the components diagram for reference Components Diagram

Containers

Containers are stored in HyperionGray's docker hub

Mongodb

define a folder for the data

sudo mkdir -p /data/db

and run the container

docker run -d -p 127.0.0.1:27017:27017 --name=mongodb --hostname=mongodb -v /data/db:/data/db hyperiongray/mongodb:1.0
Kafka
docker run -d -p 127.0.0.1:9092:9092 -p 127.0.0.1:2181:2181 --name kafka-2.11-0.10.1.1-2.4 --hostname=hh-kafka hyperiongray/kafka-2.11-0.10.1.1:2.4

wait 10 secs for the service to fully start and be ready for connections

Elasticsearch
docker run -d -p 127.0.0.1:9200:9200 -p 127.0.0.1:9300:9300 --name=elasticsearch --hostname=elasticsearch elasticsearch:2.0

Lastly check HH-DeepDeep installation notes about running it with Docker

Configuration

Properties are defined in /ui/settings.py

Installation

The app runs on python 2.7 and the dependencies can be installed with pip

pip install -r requirements.txt

then start up the flask server

python runserver.py

The app should be listening on http://localhost:5081 with the admin credentials: [email protected] / changeme!

Dockerized version of Sitehound

Alternatively, a container can be run instead of the local installation

sitehound_version="3.3.2"
docker run -d -p 0.0.0.0:5081:5081 --name=sitehound-$sitehound_version --hostname=sitehound --link mongodb:mongodb --link kafka-2.11-0.10.1.1-2.4:hh-kafka --link elasticsearch:hh-elasticsearch hyperiongray/sitehound:$sitehound_version

define hyperion gray

sitehound-frontend's People

Contributors

atowler avatar ctwardy avatar fornarat avatar lopuhin avatar mehaase avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sitehound-frontend's Issues

Add Known URLs: allow incoming to be "Neutral" (vs. default "Relevant")

[Discussed on Slack #pagetype ~3 weeks ago. I dad accidentally posted this to thh-classifiers.]

In "Add Known URLs", the user can supply a line-separated list of ostensibly-known URLs. Currently they come in as "Relevant". I'd like these pages to come in tagged as "Neutral", and then review.

In my use case, I am using SiteHound to find the relevant ones. I supply hundreds of likely, but unverified URLs from past crawls. Many were once relevant, but are now 404 or expired domains. Some were simply false positives. Starting "Neutral" makes it easy to find the pages not yet sorted into "Relevant" and "Irrelevant".

That will be esp. important if I later add a new batch of pages to review: coming in "Neutral" allows users easily to find and tag them.

(Similarly the option could be extended to user-defined labels, though in my case coming in unlabeled is just right.)

Request: allow rename workspace

I created a workspace called "pagetype test" but it's basically just forum pages and I'd like to rename it accordingly. I would be nice to have a "rename" feature on the workspace dashboard or just after entering a workspace.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.