hazyresearch / mindbender Goto Github PK

View Code? Open in Web Editor NEW

116.0 41.0 32.0 2.8 MB

Tools for iterative knowledge base development with DeepDive

Shell 24.11% CoffeeScript 51.14% HTML 18.70% JavaScript 0.86% CSS 0.44% Makefile 0.32% JSONiq 4.24% Vim Script 0.18%

mindbender's Introduction

Mindbender

Mindbender is a set of tools for iterative knowledge base construction with DeepDive.

Synopsis

Installation

Download a release of Mindbender.
Mark the downloaded file as executable (by running chmod +x mindbender-*.sh).
Place it into a directory that is on the $PATH environment (e.g., /usr/local/bin/mindbender), also renaming it so you can simply type mindbender later.

Alternatively, you can build and install from source by running make install PREFIX=/usr/local.

Latest Example

See examples/spouse_example for more details about using the tools included in Mindbender.

Launch Mindtagger for labeling data

mindbender tagger examples/labeling/**/mindtagger.conf
# See-also: ./examples/labeling/start-mindtagger.sh

Take snapshots of your DeepDive app, producing various reports

cd your-deepdive-app
mindbender snapshot
open snapshot/LATEST/README.md

Launch Dashboard to use the reports interactively for deeper error analysis

cd your-deepdive-app
mindbender dashboard

There are some examples included in this source tree:

cd examples/dashboard/spouse_example
mindbender dashboard

Mindtagger

Mindtagger is an interactive data labeling tool. Please refer to the DeepDive documentation for more details on how to use Mindtagger to estimate precision of DeepDive apps. For marking up text documents in general, e.g., for recall estimation, please take a look at the example tasks for the moment: genomics-recall and genomics-recall-relation in the source tree. They can be launched using the following script:

./examples/labeling/start-mindtagger.sh

mindbender's People

Contributors

Stargazers

Watchers

mindbender's Issues

MT recall mode: label all occurrences of same phrase

If we can easily label all occurrences of a same phrase, that would be very useful for productivity.

json-for and/or transpose-json trip if CSV file contains characters "{" or "}"

Cannot parse CSV
[SyntaxError: Unexpected end of input]

MT recall mode: initialize labels from extractions & automatic P/R analysis

If one can dump initial labels from database, and later only need to modify existing labels, that would be very useful. Now we can dump extractions and highlight them as a hint, but it would be nicer to just use these extractions to generate initial tags.

It would be also nice to be able to compute the diff with original tags, to understand precision / recall errors the system is making. e.g. all tags in PERSON that exist in the manual label but not the extractions label are recall errors. all tags in PERSON that exist in the extractions but not manual labels are precision errors. Ultimately there could be a panel showing the P/R for all relations.

A cleaner design might be: enable users to specify one or multiple extraction file(s), say extractions.csv. Before the first labeling, one can click a button to initialize all manual tags from the extractions. These extractions files are dumped after each run of DeepDive. With these files one can compue P/R of the current system, based on automatic comparison with the labeled tags.json. The P/R panel could be added to the interface.

Implement a Dashboard task that creates Mindtagger task for inspecting examples in a calibration plot bin

Faceted Search in MBS: Support unified ES doc type so we can have faceted search

Currently MBS creates a separate ES doc type (equivalent to a DB table) for each source or extraction relation. Only one doc type can be searched / rendered at a time. There is no way for the user to perform the typical faceted search because each extraction relation is typically one facet (e.g., age, name, city). Facets do not even propagate across child-parent / FK links.

One possible work-around is to have a post-processing step in DD that joins the source table with all extraction tables (say using array_agg(extraction_value) by doc_id), and then define all the facets on this unified table.

Alternatively, we could add annotation support for the above use case. For example, @reference_inline would let the parent relation absorb all the navigable/searchable fields of the child relation. That would make ES mapping generation a bit less straightforward. For index creation, we could either do the join in SQL and populate ES in one pass or perform multi-pass ES updates (one source/extraction table per pass).

@netj @chrismre

Source Directory Must Be Lowercase

When running Mindbender after running my DeepDive project, I got this error:

lh@dd:~/repos/HardwareDeepDive$ mindbender search update
Launching Elasticsearch for http://localhost:9200 from /home/lh/repos/HardwareDeepDive/search
Deriving mappings for relations table_dump part_num_mentions part_stg_temp_min_label gold_data part_stg_temp_min
Creating index HardwareDeepDive
Elasticsearch error: InvalidIndexNameException[[HardwareDeepDive] Invalid index name [HardwareDeepDive], must be lowercase]

Simply renaming my ~/repos/HardwareDeepDive directory to ~/repos/hardwaredeepdive seems to have solved the problem.

invalid mindbender COMMAND

I am using the latest version of Deepdive. I tried to use the following cmds in the README.md, but got the following issues:
$ mindbender search update

mindbender shell

Usage: mindbender [-OPTION] COMMAND [ARG]...

Global OPTION is one of:

-v increase verbosity

-q suppress all messages

-t force logging to non-ttys

(default is to log messages to stderr only when it's a tty)

search: invalid COMMAND

$ mindbender search gui

mindbender shell

Usage: mindbender [-OPTION] COMMAND [ARG]...

Global OPTION is one of:

-v increase verbosity

-q suppress all messages

-t force logging to non-ttys

(default is to log messages to stderr only when it's a tty)

search: invalid COMMAND

Did I try something wrong?

How to change ip:port config to run mindbender in a server?

Actually I install deepdive in an ubuntu server without GUI. The command "mindbender search gui" runs at localhost:8000 and may elasticsearch runs at localhost...How do I change config files that can run by ip:port like 194.1.168.3:8000?

Labeling cross-sentence relations in recall mode

It might be useful in some applications to label for cross-sentence relations. Right now the relation labeling only support single-sentence ones.

MB Search relies on Postgres 9.3+ for to_json (so PGXL and Greenplum are out of luck)

Learned from @netj that MB Search works only with Postgresql 9.3 (and not PGXL or Greenplum) because it uses the to_json function.

How hard is it to drop the requirement for to_json? Or at least ship a basic version of it. E.g., the following could be used to patch PGXL:
http://www.pgxn.org/dist/json_enhancements/doc/json_enhancements.html
https://bitbucket.org/IVC-Inc/json_enhancements/overview

Exported Insert.sql column type error

For the generated insert.sql file, the column types for labels should be BOOLEAN but it is generated as TEXT. This causes error on Greenplum.

$ psql $DBNAME < insert.sql
NOTICE:  table "tags_precision_m_claim_type_v1" does not exist, skipping
DROP TABLE
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'doc_id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
ERROR:  column "is_correct" is of type text but expression is of type boolean
LINE 2: ("doc_id", "mention_id", "is_correct", ...
                                 ^
HINT:  You will need to rewrite or cast the expression.

Automatic tags.json backup

We should be able to regularly backup tags.json. This file is so important that we should back up every hour, every 10 modifications, or so.

tags.json permission problems

We had a use case when user A on a server S started mindtagger for labeling, and user C visited the labeling interface and tags.json is generated.

At time T, user B logged into the same server S and started mindtagger in the same folder. However, user B do not have permission to change the tags.json file, so when C continue to do labeling, changes cannot be saved to tags.json, however from the frontend C cannot tell there is a problem in labeling. The tags.json file kept as it was in time T, since no additional changes could be saved.

Possible solutions might be force tags.json writable to everyone, or check permission problems and report (to frontend?)

Dashboard report table rows are always ordered by the first column by DataTables

We should preserve the order.

Refactor snapshot report content into a Angular directive

Child template creation destroys parent templates

Creating a new child template (e.g. "variable/test") empties the parent template (e.g. "variable"). It does not empty sibling templates (e.g. "variable/feature").

Spouse example does not work with latest DeepDive/ddlog

[error] app.ddlog[67.26] failure: `(' expected but `l' found

function ext_people over like ext_people_input

                         ^

The example is different from the example in the ddlog repository. Did the syntax change?

Create template: Rename "formatted" to "Pure SQL"; rename "custom" to "Mixed SQL/Shell Script"

I finally really got the difference between "formatted" and "custom". Maybe follow the renaming suggestion? It would have helped me.

BTW obviously: Feel free to delete all of these issues if you think the suggestions are stupid ;) .

MT recall mode: performance issues

MindTagger recall mode (at least in version 0.2.1) is slow: it takes 10--20 seconds to load a page. Would be nice if this can be speeded up.

@netj @chrismre

Small UI issue

With the current version of Mindtagger, the line height will change after a label is created. And if the mouse is hovered on the mention, the line height changes quite a bit:

Though it's a minor issue, I think this somehow affects the user experience. =)

is_correct counts disappear after restarting mindtagger

I suppose mindtagger reloads the annoatations from tags.json every time it restarts. The is_correct annotations are still there but the distribution of is_correct values is missing. Please see the attached screenshot. -Xiao

Support binding values for tasks from charts in Dashboard

Automatically switch from "formatted" to "custom" and back if the first word is/is not "SELECT" or emit a warning

When adding a template to a snapshot configurations, not all possible fields are available

In the genepheno configuration:

Add template "variable". The fields "features_column" and "features_layout" are missing, but are present in the existing "variable" template configuration.

MT Shortkey

An idea of short keys: one could allow different labels associated with one shortcut key, and switching between different labels if this key pressed multiple times.

e.g. address and amount both have shortkey a, and
a pressed once: address selected.
a pressed twice: amount selected.

After opening a snapshot, pane is completely empty; fills after multiple refreshes

This seems to become more common as I add more reports to the snapshots. I routinely need to refresh up to 20 times before the data suddenly appears. The left pane is also completely empty.

Fix Dashboard bar chart for categorical X axis

Make snapshot configuration editor less bulky

by showing dropdowns and input entries on demand

Support @searchable for text[]?

I try @searchable for a column with text type and it works fine.
But when I try to @searchable for a column with text[] type, it doesn't work and displays some other information, e.g., NER, POS, LEMMA...etc...
Do we support @searchable for text[]?

Dashboard: If table (from data.sql.in) comes back empty, clicking on empty table link produces different document

It's a little hard to explain, but I'll try here:

Assume you have some "main" report somewhere (README.md.in)
Assume you have a table report somewhere else (generated from data.sql.in)
Assume the select query for the table comes back empty
When clicking on the report of the data.sql.in table, the README.md.in will show instead

Single-threaded jq can be a bottleneck for ES indexing

We could backport this change to parallelize indexing:
bc869e8

It simply uses parallel instead of split. These improvements for a backport would be great:

check for presence of parallel
configurable parallelization params

MT Recall Mode: Label for relations

MT recall mode is useful for labeling mentions, but painful in labeling relations. As I mentioned in a separate email, currently to label for all relations in a document I need to:

hack the display of the mention_id in the bottom tags navbar
make a selection of mention 1, then copy the mention_id into a TSV file
make a selection of mention 2, then copy the mention_id into the same line of TSV file

It would be great to naturally support labeling recall in the interface. It would be simply nice to be able to:

select the two mentions in the interface, then click a button, and the two mentioned are labeled as a relation.

or alternatively:

Select mention 1, click "add to relation"
select mention 2, click "add to relation"
then click "finish labeling a relation" and they are recorded.

In the latter way one would be able to even label a relation in different pages.

As an idea to visualize relation, mentions in a relation can be marked up with a same number in the top-right corner.

@netj @chrismre

Adapt Mindtagger Instance for Genepheno Precision alone

Copy labeling/templates/genepheno-holdout or adapt labeling/templates/genepheno-precision in order to label only sentences with expectation > 0.9 to better estimate precision. You should modify at least input.sql , or (to add features) also template.html

Referencing non-existent variable in README.md.in leads to missing report in Dashboard "View Snapshot"

[Dashboard] Calibration plot contains empty strings

The JSON file for the calibration plot report contains empty strings. These empty strings should probably be zeros.

Mindtagger cannot specify host to listen to

Mindtagger by default use os.hostname variable to decide the frontend host, but there might be cases where users want to specify the host.

Cannot run deepdive in examples/spouse_example folder

In the README.md, it said we can run spouse example based on following cmds:
cd examples/spouse_example
deepdive initdb
deepdive run

Btw: the path is wrong. Can you clarify it?

How to trace source sentence with multiple keys?

I want to trace the source sentence for my extraction based on multiple keys, and I write something like:

@source
sentences(
@key
doc_id text,
@key
section_id text,
ref_doc_id text,
@key
sent_id int,
@searchable
words text[],
lemmas text[],
poses text[],
ners text[],
dep_paths text[],
dep_parents int[]).

@Extraction
gene_mentions(
id bigint,
@references(relation="sentences", column="doc_id", alias="sent_gene")
doc_id text,
@references(relation="sentences", column="section_id", alias="sent_gene")
section_id text,
@references(relation="sentences", column="sent_id", alias="sent_gene")
sent_id int,
wordidxs int[],
@key
mention_id text,
supertype text,
subtype text,
@searchable
entity text,
@searchable
words text[],
@navigable
is_correct boolean).

But it doesn't work, am I missing something? How can I use multiple keys to find the source sentence?

After "Delete template": server stops responding

Create a template
Delete it
Click on "Configure Templates" again (I initially did this because deletion did not seem to work, cf #34 )
Server never loads page again

[Dashboard] Formatted reports include html

At least with many of the sample reports in the stanford-memex data, the API call to GET api/snapshot/snapshotId/reportId includes an "html" key for formatted reports. This makes it difficult to determine formatted from custom reports, as this determination is based on whether the "html" key is present in the report JSON.

Example:
/api/snapshot/20150410-6/variable/candidate/sample-frequent-candidates%20rates.is_correct

The resulting JSON includes a "chart" key under the first data key, suggesting that it is a formatted report, but also includes an "html" key.

Factor out export tags functionality to a separate command

Exporting tags.json of Mindtagger tasks is currently only possible via the GUI backend. This code can be put into its own command for more automation, and the GUI can simply call the command as we do for most other backend APIs.

Throw meaningful errors when port is taken

Now the port is taken, it shows error like:

Starting Mindtagger for all tasks under /lfs/local/0/senwu/labeling/labeling2/...
Parsing task configuration: location_mentions_precision/mindtagger.conf

events.js:72
        throw er; // Unhandled 'error' event
              ^
Error: listen EACCES
  at errnoException (net.js:904:11)
  at Server._listen2 (net.js:1023:19)
  at listen (net.js:1064:10)
  at Server.listen (net.js:1138:5)
  at Object.<anonymous> (/tmp/mindbender-senwu/9a0f9034fadb8b42437951c90241a43febf3af65/gui/server.coffee:51:8)
  at Object.<anonymous> (/tmp/mindbender-senwu/9a0f9034fadb8b42437951c90241a43febf3af65/gui/server.coffee:2:1)
  at Module._compile (module.js:456:26)

The error message could be more intuitive.

Master branch doesn't build, but there's an easy fix

When you build master and launch the application, you'll get a javascript error in the browser. The problem is that some coffeescript gets translated into incorrect javascript:

angular.module('mindbender.dashboard', ['ui.ace', 'ui.bootstrap', 'ui.sortable'].service('Dashboard', ...

Note that there's a closing parenthesis missing after 'ui.sortable']. This is because the coffeescript compiler used is version 1.6.3 which doesn't correctly handle angularjs code.

The correct version 1.8.0 is specified in gui/frontend/bower.json, but it never gets called. Instead, an npm package (karma) forces installation of coffeescript 1.6.3 into gui/frontend/node_modules which is then used in the build.

There's an easy fix: Add "coffee-script": "1.8.0" to gui/frontend/package.json in section devDependencies.

(There's also a second error that this fixes: the older coffeescript didn't support the power operator, eg. 3 ** 3)

Dashboard API /api/snapshot-template/* should include all (at least the required) parameters of nested templates

To provide UI elements for specifying parameters in the snapshot-config editor.

"Delete Snapshot" and "Add Snapshot" do not show up immediately after creation

Select a snapshot and delete: Result: Animation and field is empty, but snapshot still in dropdown --- reload to get rid of it
"Add configuration": Enter name; snapshot is not selected or selectable; have to reload for it to show up.

mindtagger search/filter function

I'm using the latest version of mindbender. I wonder if mindtagger has keyword search that allows me to quickly locate some prediction instance I labeled. Also I'd like to filter prediction instances by their tag values (i.e. I only want to see the examples labeled with a particular tag value). I'm new to the tool. If these functions already exist, please point out. I tried the search box on the top of the page and it didn't work. Neither did clicking the tag values work. -Xiao

Recall mode bottom navbar blocking text

In recall mode, sometimes the bottom bar will cover some text if there are too many tags in the sentence. Is there an easy css fix for this?

MT Recall mode: Pagination / Render by document

In MindTagger recall mode, it is not naturally supported to render by document. All the sentences are rendered in a flat manner, and there is no clear boundary between documents. One can hack this around by manipulating input data and template, but it would be nice to naturally support this.

@netj @chrismre

Show all reports in a flow layout that's easily scrollable, putting the side nav as a dropdown menu

LATEST release having trouble with recall mode

When using the v0.2.0 for recall mode it goes on well, but it seems that LATEST will break the normal functionality of recall mode. One cannot navigate through words in sentences.

hazyresearch / mindbender Goto Github PK

mindbender's Introduction

Mindbender

Synopsis

Installation

Latest Example

Launch Mindtagger for labeling data

Take snapshots of your DeepDive app, producing various reports

Launch Dashboard to use the reports interactively for deeper error analysis

Mindtagger

mindbender's People

Contributors

Stargazers

Watchers

Forkers

mindbender's Issues

mindbender shell

Usage: mindbender [-OPTION] COMMAND [ARG]...

Global OPTION is one of:

-v increase verbosity

-q suppress all messages

-t force logging to non-ttys

(default is to log messages to stderr only when it's a tty)

mindbender shell

Usage: mindbender [-OPTION] COMMAND [ARG]...

Global OPTION is one of:

-v increase verbosity

-q suppress all messages

-t force logging to non-ttys

(default is to log messages to stderr only when it's a tty)

Recommend Projects

Recommend Topics

Recommend Org