Coder Social home page Coder Social logo

argus's Introduction

YES/NO question answering

Run:

python web_interface.py

then open http://0.0.0.0:5500/ or a version that also handles network timeouts well:

PYTHONIOENCODING=utf8 uwsgi --master --plugins python --http-socket "[::]:5500" -p 1 --manage-script-name --mount /=web_interface:app &

Architecture

The web_interface runs the web frontend as well as the basic search, analysis and scoring pipeline. However, the neural network processing of snippets and scoring logic is part of the sentence pair scoring package

https://github.com/brmson/dataset-sts (f/bigvocab branch at the moment)

relying on its Argus dataset. You can clone that repo, run tools/hypev-api.py and modify the url in argus/features.py.

Setup

python -m nltk.downloader maxent_treebank_pos_tagger
python -m nltk.downloader wordnet
pip install spacy
python -m spacy.en.download all

Testing

With mTurk output files present in tests/batches, running

python preprocess_output.py -regen

will create bunch of output files in tests/ folder, that contain texts and feature values for all found sources.

Algorithm

Described in detail here: https://github.com/AugurProject/argus-paper

ElasticSearch

Install ElasticSearch .deb from the website https://www.elastic.co/downloads/elasticsearch and Python bindings using:

pip install elasticsearch

Start it up (default, runs on localhost:9200)

sudo /etc/init.d/elasticsearch start

to fill it up run (from argus)

python fill_elastic.py [-G{path to folder with guardian jsons}] [-NY{path to folder with nytimes jsons} -RSS{path to root of rss folders}]

note: to clear the database run

curl -XDELETE localhost:9200/argus

Training

If you didn't already, run

python preprocess_output.py -regen

to create new output tsv files with up-to-date feature vectors. These vectors can be then used for training process within dataset-sts repo, with an example command in data/hypev/argus/README.md . Then, restart the hypev-api.

To reevaluate system performance with retrained classifier, run preprocess_output.py.

If you want to train with some features off, open output.tsv and delete the classification or relevance symbol in the feature name.

Adding Features

  1. Each feature must inherit from the Feature object and must set its type (clas and/or rel) and value (name and info are also desirable). Look for already implemented features in argus/features.py
  2. To make the system use new feature, add string with the feature object name to feature_list list AND to the feature_list_official with its type symbols (you can change the name, only the types are important).
  3. Then run preprocess_output.py -regen to retrieve the feature, then train
  4. To stop using the feature, simply erase it from feature_list and feature_list_official

Currently used symbols: classification = '#', relevance = '@'

Error analysis

After running batch_test, system generates various error analysis files in tests/feature prints, most notably all_features.tsv which contains gold standard + information about all features.

Data set

To generate the data set of question-label-sentence triplets (mainly for use in github/brmson/dataset-sts), run preprocess_outputt.py -regen and generate.py hidden in tests/data_gen/ (argus_test[train].csv will be created in tests/data_gen).

argus's People

Contributors

pasky avatar silvicek avatar tealmill avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

argus's Issues

Training / web inconsistency

There seems to be some inconsistency between web interface and the training process:

  • Will the New England Patriots defeat the Seattle Seahawks in Super Bowl XLIX? in tests/ftrain.tsv is marked as YES 0.521451771259 with "as it happened" as the most relevant article.
  • In the web interface right now, with the same model, it's 0.08 for NO and ABC comes first by relevancy, with "as it happened" story as second

Odd commodity output for Aluminium

When using the benchtest script, the output for the text question regarding aluminium is odd; despite wide date range being selected, both min and max values are the same value and both refer to the same day. Is only a single data point being utilized to describe the entire range?

  • Implement proper search for the minimal/maximal day in the range or define special behaviour.

Panoptos backends for finance domain

  • Stock market (using Yahoo API - NYSE, NASDAQ, LSE and more covered)
  • Commodities (oil, gold, silver, copper and more on the U.S. commodities market and exchange, see e.g. CNNMoney)
  • Currency exchange rates
  • ecurrency prices

Date entry in web interface

For evaluation, we have generated dates of events for the questions in our set (well, at least some of them; XXX) and whenever available, we use that to limit the search to a 14-day period after that date to improve result relevance.

It'd be nice to have an optional date entry in the web interface too, since our original motivation would be that this data is available in Augur anyway.

Panoptes backend for weather

Lower priority.

  • Investigate Wolfram Alpha
  • Investigate forecast.io
  • Investigate Yahoo Weather API
  • Look for more?

Data types:

  • Temperature

Clean up train/test splits

We should do a train/test split on input data before processing anything, not just when reporting the results, to ensure that proper data hygiene is kept.

(At a later time, we should also further split the train to trainmodel and val and perform learning of sub-classifiers like the sentiment on trainmodel and measure its performance on val rather than test, so that we don't overfit by parameter tuning. However, we have too little data to afford that at this point, so it's just something to bear in mind for now.)

Panoptes backend for sports domain

This has a lot of open questions, e.g. how to precisely specify the match / race.

Sources:

  • Investigate Yahoo Fantasy - bad, see wiki
  • Investigate scraping NBCSports - against ToS
  • Investigate Wolfram Alpha
  • Investigate http://www.cricapi.com/ for Cricket, specifically

Baseline choice (NBA NFL NHL): https://www.stattleship.com/

Domains:

  • soccer
  • tennis
  • cricket
  • basketball
  • baseball
  • football
  • golf
  • boxing
  • nascar
  • horse racing

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.