Coder Social home page Coder Social logo

hazyresearch / mindbender Goto Github PK

View Code? Open in Web Editor NEW
116.0 41.0 32.0 2.8 MB

Tools for iterative knowledge base development with DeepDive

Shell 24.11% CoffeeScript 51.14% HTML 18.70% JavaScript 0.86% CSS 0.44% Makefile 0.32% JSONiq 4.24% Vim Script 0.18%

mindbender's Introduction

Mindbender

Mindbender is a set of tools for iterative knowledge base construction with DeepDive.

Synopsis

Installation

  1. Download a release of Mindbender.
  2. Mark the downloaded file as executable (by running chmod +x mindbender-*.sh).
  3. Place it into a directory that is on the $PATH environment (e.g., /usr/local/bin/mindbender), also renaming it so you can simply type mindbender later.

Alternatively, you can build and install from source by running make install PREFIX=/usr/local.

Latest Example

See examples/spouse_example for more details about using the tools included in Mindbender.


Launch Mindtagger for labeling data

mindbender tagger examples/labeling/**/mindtagger.conf
# See-also: ./examples/labeling/start-mindtagger.sh

Take snapshots of your DeepDive app, producing various reports

cd your-deepdive-app
mindbender snapshot
open snapshot/LATEST/README.md

Launch Dashboard to use the reports interactively for deeper error analysis

cd your-deepdive-app
mindbender dashboard

There are some examples included in this source tree:

cd examples/dashboard/spouse_example
mindbender dashboard

Mindtagger

Mindtagger is an interactive data labeling tool. Please refer to the DeepDive documentation for more details on how to use Mindtagger to estimate precision of DeepDive apps. For marking up text documents in general, e.g., for recall estimation, please take a look at the example tasks for the moment: genomics-recall and genomics-recall-relation in the source tree. They can be launched using the following script:

./examples/labeling/start-mindtagger.sh

mindbender's People

Contributors

alldefector avatar netj avatar oxymor0n avatar thomaspalomares avatar wonyeol avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mindbender's Issues

MT recall mode: initialize labels from extractions & automatic P/R analysis

If one can dump initial labels from database, and later only need to modify existing labels, that would be very useful. Now we can dump extractions and highlight them as a hint, but it would be nicer to just use these extractions to generate initial tags.

It would be also nice to be able to compute the diff with original tags, to understand precision / recall errors the system is making. e.g. all tags in PERSON that exist in the manual label but not the extractions label are recall errors. all tags in PERSON that exist in the extractions but not manual labels are precision errors. Ultimately there could be a panel showing the P/R for all relations.

A cleaner design might be: enable users to specify one or multiple extraction file(s), say extractions.csv. Before the first labeling, one can click a button to initialize all manual tags from the extractions. These extractions files are dumped after each run of DeepDive. With these files one can compue P/R of the current system, based on automatic comparison with the labeled tags.json. The P/R panel could be added to the interface.

Faceted Search in MBS: Support unified ES doc type so we can have faceted search

Currently MBS creates a separate ES doc type (equivalent to a DB table) for each source or extraction relation. Only one doc type can be searched / rendered at a time. There is no way for the user to perform the typical faceted search because each extraction relation is typically one facet (e.g., age, name, city). Facets do not even propagate across child-parent / FK links.

One possible work-around is to have a post-processing step in DD that joins the source table with all extraction tables (say using array_agg(extraction_value) by doc_id), and then define all the facets on this unified table.

Alternatively, we could add annotation support for the above use case. For example, @reference_inline would let the parent relation absorb all the navigable/searchable fields of the child relation. That would make ES mapping generation a bit less straightforward. For index creation, we could either do the join in SQL and populate ES in one pass or perform multi-pass ES updates (one source/extraction table per pass).

@netj @chrismre

Source Directory Must Be Lowercase

When running Mindbender after running my DeepDive project, I got this error:

lh@dd:~/repos/HardwareDeepDive$ mindbender search update
Launching Elasticsearch for http://localhost:9200 from /home/lh/repos/HardwareDeepDive/search
Deriving mappings for relations table_dump part_num_mentions part_stg_temp_min_label gold_data part_stg_temp_min
Creating index HardwareDeepDive
Elasticsearch error: InvalidIndexNameException[[HardwareDeepDive] Invalid index name [HardwareDeepDive], must be lowercase]

Simply renaming my ~/repos/HardwareDeepDive directory to ~/repos/hardwaredeepdive seems to have solved the problem.

invalid mindbender COMMAND

I am using the latest version of Deepdive. I tried to use the following cmds in the README.md, but got the following issues:
$ mindbender search update

mindbender shell

Usage: mindbender [-OPTION] COMMAND [ARG]...

Global OPTION is one of:

-v increase verbosity

-q suppress all messages

-t force logging to non-ttys

(default is to log messages to stderr only when it's a tty)

search: invalid COMMAND

$ mindbender search gui

mindbender shell

Usage: mindbender [-OPTION] COMMAND [ARG]...

Global OPTION is one of:

-v increase verbosity

-q suppress all messages

-t force logging to non-ttys

(default is to log messages to stderr only when it's a tty)

search: invalid COMMAND

Did I try something wrong?

How to change ip:port config to run mindbender in a server?

Actually I install deepdive in an ubuntu server without GUI. The command "mindbender search gui" runs at localhost:8000 and may elasticsearch runs at localhost...How do I change config files that can run by ip:port like 194.1.168.3:8000?

Exported Insert.sql column type error

For the generated insert.sql file, the column types for labels should be BOOLEAN but it is generated as TEXT. This causes error on Greenplum.

$ psql $DBNAME < insert.sql
NOTICE:  table "tags_precision_m_claim_type_v1" does not exist, skipping
DROP TABLE
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'doc_id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
ERROR:  column "is_correct" is of type text but expression is of type boolean
LINE 2: ("doc_id", "mention_id", "is_correct", ...
                                 ^
HINT:  You will need to rewrite or cast the expression.

Automatic tags.json backup

We should be able to regularly backup tags.json. This file is so important that we should back up every hour, every 10 modifications, or so.

tags.json permission problems

We had a use case when user A on a server S started mindtagger for labeling, and user C visited the labeling interface and tags.json is generated.

At time T, user B logged into the same server S and started mindtagger in the same folder. However, user B do not have permission to change the tags.json file, so when C continue to do labeling, changes cannot be saved to tags.json, however from the frontend C cannot tell there is a problem in labeling. The tags.json file kept as it was in time T, since no additional changes could be saved.

Possible solutions might be force tags.json writable to everyone, or check permission problems and report (to frontend?)

Small UI issue

With the current version of Mindtagger, the line height will change after a label is created. And if the mouse is hovered on the mention, the line height changes quite a bit:

image

Though it's a minor issue, I think this somehow affects the user experience. =)

is_correct counts disappear after restarting mindtagger

I suppose mindtagger reloads the annoatations from tags.json every time it restarts. The is_correct annotations are still there but the distribution of is_correct values is missing. Please see the attached screenshot. -Xiao
screen shot 2015-09-29 at 7 13 32 pm

MT Shortkey

An idea of short keys: one could allow different labels associated with one shortcut key, and switching between different labels if this key pressed multiple times.

e.g. address and amount both have shortkey a, and
a pressed once: address selected.
a pressed twice: amount selected.

MT Recall Mode: Label for relations

MT recall mode is useful for labeling mentions, but painful in labeling relations. As I mentioned in a separate email, currently to label for all relations in a document I need to:

  1. hack the display of the mention_id in the bottom tags navbar
  2. make a selection of mention 1, then copy the mention_id into a TSV file
  3. make a selection of mention 2, then copy the mention_id into the same line of TSV file

It would be great to naturally support labeling recall in the interface. It would be simply nice to be able to:

  • select the two mentions in the interface, then click a button, and the two mentioned are labeled as a relation.

or alternatively:

  • Select mention 1, click "add to relation"
  • select mention 2, click "add to relation"
  • then click "finish labeling a relation" and they are recorded.

In the latter way one would be able to even label a relation in different pages.

As an idea to visualize relation, mentions in a relation can be marked up with a same number in the top-right corner.

@netj @chrismre

Adapt Mindtagger Instance for Genepheno Precision alone

Copy labeling/templates/genepheno-holdout or adapt labeling/templates/genepheno-precision in order to label only sentences with expectation > 0.9 to better estimate precision. You should modify at least input.sql , or (to add features) also template.html

How to trace source sentence with multiple keys?

I want to trace the source sentence for my extraction based on multiple keys, and I write something like:

@source
sentences(
@key
doc_id text,
@key
section_id text,
ref_doc_id text,
@key
sent_id int,
@searchable
words text[],
lemmas text[],
poses text[],
ners text[],
dep_paths text[],
dep_parents int[]).

@Extraction
gene_mentions(
id bigint,
@references(relation="sentences", column="doc_id", alias="sent_gene")
doc_id text,
@references(relation="sentences", column="section_id", alias="sent_gene")
section_id text,
@references(relation="sentences", column="sent_id", alias="sent_gene")
sent_id int,
wordidxs int[],
@key
mention_id text,
supertype text,
subtype text,
@searchable
entity text,
@searchable
words text[],
@navigable
is_correct boolean).

But it doesn't work, am I missing something? How can I use multiple keys to find the source sentence?

[Dashboard] Formatted reports include html

At least with many of the sample reports in the stanford-memex data, the API call to GET api/snapshot/snapshotId/reportId includes an "html" key for formatted reports. This makes it difficult to determine formatted from custom reports, as this determination is based on whether the "html" key is present in the report JSON.

Example:
/api/snapshot/20150410-6/variable/candidate/sample-frequent-candidates%20rates.is_correct

The resulting JSON includes a "chart" key under the first data key, suggesting that it is a formatted report, but also includes an "html" key.

Throw meaningful errors when port is taken

Now the port is taken, it shows error like:

Starting Mindtagger for all tasks under /lfs/local/0/senwu/labeling/labeling2/...
Parsing task configuration: location_mentions_precision/mindtagger.conf

events.js:72
        throw er; // Unhandled 'error' event
              ^
Error: listen EACCES
  at errnoException (net.js:904:11)
  at Server._listen2 (net.js:1023:19)
  at listen (net.js:1064:10)
  at Server.listen (net.js:1138:5)
  at Object.<anonymous> (/tmp/mindbender-senwu/9a0f9034fadb8b42437951c90241a43febf3af65/gui/server.coffee:51:8)
  at Object.<anonymous> (/tmp/mindbender-senwu/9a0f9034fadb8b42437951c90241a43febf3af65/gui/server.coffee:2:1)
  at Module._compile (module.js:456:26)

The error message could be more intuitive.

Master branch doesn't build, but there's an easy fix

When you build master and launch the application, you'll get a javascript error in the browser. The problem is that some coffeescript gets translated into incorrect javascript:

angular.module('mindbender.dashboard', ['ui.ace', 'ui.bootstrap', 'ui.sortable'].service('Dashboard', ...

Note that there's a closing parenthesis missing after 'ui.sortable']. This is because the coffeescript compiler used is version 1.6.3 which doesn't correctly handle angularjs code.

The correct version 1.8.0 is specified in gui/frontend/bower.json, but it never gets called. Instead, an npm package (karma) forces installation of coffeescript 1.6.3 into gui/frontend/node_modules which is then used in the build.

There's an easy fix: Add "coffee-script": "1.8.0" to gui/frontend/package.json in section devDependencies.

(There's also a second error that this fixes: the older coffeescript didn't support the power operator, eg. 3 ** 3)

mindtagger search/filter function

I'm using the latest version of mindbender. I wonder if mindtagger has keyword search that allows me to quickly locate some prediction instance I labeled. Also I'd like to filter prediction instances by their tag values (i.e. I only want to see the examples labeled with a particular tag value). I'm new to the tool. If these functions already exist, please point out. I tried the search box on the top of the page and it didn't work. Neither did clicking the tag values work. -Xiao

MT Recall mode: Pagination / Render by document

In MindTagger recall mode, it is not naturally supported to render by document. All the sentences are rendered in a flat manner, and there is no clear boundary between documents. One can hack this around by manipulating input data and template, but it would be nice to naturally support this.

@netj @chrismre

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.