hazyresearch / mindbender Goto Github PK
View Code? Open in Web Editor NEWTools for iterative knowledge base development with DeepDive
Tools for iterative knowledge base development with DeepDive
Currently MBS creates a separate ES doc type (equivalent to a DB table) for each source or extraction relation. Only one doc type can be searched / rendered at a time. There is no way for the user to perform the typical faceted search because each extraction relation is typically one facet (e.g., age, name, city). Facets do not even propagate across child-parent / FK links.
One possible work-around is to have a post-processing step in DD that joins the source table with all extraction tables (say using array_agg(extraction_value) by doc_id
), and then define all the facets on this unified table.
Alternatively, we could add annotation support for the above use case. For example, @reference_inline
would let the parent relation absorb all the navigable/searchable fields of the child relation. That would make ES mapping generation a bit less straightforward. For index creation, we could either do the join in SQL and populate ES in one pass or perform multi-pass ES updates (one source/extraction table per pass).
I try @searchable for a column with text type and it works fine.
But when I try to @searchable for a column with text[] type, it doesn't work and displays some other information, e.g., NER, POS, LEMMA...etc...
Do we support @searchable for text[]?
Mindtagger by default use os.hostname variable to decide the frontend host, but there might be cases where users want to specify the host.
We could backport this change to parallelize indexing:
bc869e8
It simply uses parallel
instead of split
. These improvements for a backport would be great:
parallel
Copy labeling/templates/genepheno-holdout or adapt labeling/templates/genepheno-precision in order to label only sentences with expectation > 0.9 to better estimate precision. You should modify at least input.sql , or (to add features) also template.html
I'm using the latest version of mindbender. I wonder if mindtagger has keyword search that allows me to quickly locate some prediction instance I labeled. Also I'd like to filter prediction instances by their tag values (i.e. I only want to see the examples labeled with a particular tag value). I'm new to the tool. If these functions already exist, please point out. I tried the search box on the top of the page and it didn't work. Neither did clicking the tag values work. -Xiao
The JSON file for the calibration plot report contains empty strings. These empty strings should probably be zeros.
In the README.md, it said we can run spouse example based on following cmds:
cd examples/spouse_example
deepdive initdb
deepdive run
Btw: the path is wrong. Can you clarify it?
Actually I install deepdive in an ubuntu server without GUI. The command "mindbender search gui" runs at localhost:8000 and may elasticsearch runs at localhost...How do I change config files that can run by ip:port like 194.1.168.3:8000?
Cannot parse CSV
[SyntaxError: Unexpected end of input]
Exporting tags.json of Mindtagger tasks is currently only possible via the GUI backend. This code can be put into its own command for more automation, and the GUI can simply call the command as we do for most other backend APIs.
To provide UI elements for specifying parameters in the snapshot-config editor.
Now the port is taken, it shows error like:
Starting Mindtagger for all tasks under /lfs/local/0/senwu/labeling/labeling2/...
Parsing task configuration: location_mentions_precision/mindtagger.conf
events.js:72
throw er; // Unhandled 'error' event
^
Error: listen EACCES
at errnoException (net.js:904:11)
at Server._listen2 (net.js:1023:19)
at listen (net.js:1064:10)
at Server.listen (net.js:1138:5)
at Object.<anonymous> (/tmp/mindbender-senwu/9a0f9034fadb8b42437951c90241a43febf3af65/gui/server.coffee:51:8)
at Object.<anonymous> (/tmp/mindbender-senwu/9a0f9034fadb8b42437951c90241a43febf3af65/gui/server.coffee:2:1)
at Module._compile (module.js:456:26)
The error message could be more intuitive.
I finally really got the difference between "formatted" and "custom". Maybe follow the renaming suggestion? It would have helped me.
BTW obviously: Feel free to delete all of these issues if you think the suggestions are stupid ;) .
[error] app.ddlog[67.26] failure: `(' expected but `l' found
function ext_people over like ext_people_input
^
The example is different from the example in the ddlog repository. Did the syntax change?
In MindTagger recall mode, it is not naturally supported to render by document. All the sentences are rendered in a flat manner, and there is no clear boundary between documents. One can hack this around by manipulating input data and template, but it would be nice to naturally support this.
Creating a new child template (e.g. "variable/test") empties the parent template (e.g. "variable"). It does not empty sibling templates (e.g. "variable/feature").
We should be able to regularly backup tags.json. This file is so important that we should back up every hour, every 10 modifications, or so.
I want to trace the source sentence for my extraction based on multiple keys, and I write something like:
@source
sentences(
@key
doc_id text,
@key
section_id text,
ref_doc_id text,
@key
sent_id int,
@searchable
words text[],
lemmas text[],
poses text[],
ners text[],
dep_paths text[],
dep_parents int[]).
@Extraction
gene_mentions(
id bigint,
@references(relation="sentences", column="doc_id", alias="sent_gene")
doc_id text,
@references(relation="sentences", column="section_id", alias="sent_gene")
section_id text,
@references(relation="sentences", column="sent_id", alias="sent_gene")
sent_id int,
wordidxs int[],
@key
mention_id text,
supertype text,
subtype text,
@searchable
entity text,
@searchable
words text[],
@navigable
is_correct boolean).
But it doesn't work, am I missing something? How can I use multiple keys to find the source sentence?
When running Mindbender after running my DeepDive project, I got this error:
lh@dd:~/repos/HardwareDeepDive$ mindbender search update
Launching Elasticsearch for http://localhost:9200 from /home/lh/repos/HardwareDeepDive/search
Deriving mappings for relations table_dump part_num_mentions part_stg_temp_min_label gold_data part_stg_temp_min
Creating index HardwareDeepDive
Elasticsearch error: InvalidIndexNameException[[HardwareDeepDive] Invalid index name [HardwareDeepDive], must be lowercase]
Simply renaming my ~/repos/HardwareDeepDive
directory to ~/repos/hardwaredeepdive
seems to have solved the problem.
For the generated insert.sql file, the column types for labels should be BOOLEAN
but it is generated as TEXT
. This causes error on Greenplum.
$ psql $DBNAME < insert.sql
NOTICE: table "tags_precision_m_claim_type_v1" does not exist, skipping
DROP TABLE
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'doc_id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
ERROR: column "is_correct" is of type text but expression is of type boolean
LINE 2: ("doc_id", "mention_id", "is_correct", ...
^
HINT: You will need to rewrite or cast the expression.
At least with many of the sample reports in the stanford-memex data, the API call to GET api/snapshot/snapshotId/reportId includes an "html" key for formatted reports. This makes it difficult to determine formatted from custom reports, as this determination is based on whether the "html" key is present in the report JSON.
Example:
/api/snapshot/20150410-6/variable/candidate/sample-frequent-candidates%20rates.is_correct
The resulting JSON includes a "chart" key under the first data key, suggesting that it is a formatted report, but also includes an "html" key.
by showing dropdowns and input entries on demand
I am using the latest version of Deepdive. I tried to use the following cmds in the README.md, but got the following issues:
$ mindbender search update
search: invalid COMMAND
$ mindbender search gui
search: invalid COMMAND
Did I try something wrong?
MT recall mode is useful for labeling mentions, but painful in labeling relations. As I mentioned in a separate email, currently to label for all relations in a document I need to:
It would be great to naturally support labeling recall in the interface. It would be simply nice to be able to:
or alternatively:
In the latter way one would be able to even label a relation in different pages.
As an idea to visualize relation, mentions in a relation can be marked up with a same number in the top-right corner.
We had a use case when user A on a server S started mindtagger for labeling, and user C visited the labeling interface and tags.json is generated.
At time T, user B logged into the same server S and started mindtagger in the same folder. However, user B do not have permission to change the tags.json file, so when C continue to do labeling, changes cannot be saved to tags.json, however from the frontend C cannot tell there is a problem in labeling. The tags.json file kept as it was in time T, since no additional changes could be saved.
Possible solutions might be force tags.json writable to everyone, or check permission problems and report (to frontend?)
We should preserve the order.
It might be useful in some applications to label for cross-sentence relations. Right now the relation labeling only support single-sentence ones.
This seems to become more common as I add more reports to the snapshots. I routinely need to refresh up to 20 times before the data suddenly appears. The left pane is also completely empty.
In recall mode, sometimes the bottom bar will cover some text if there are too many tags in the sentence. Is there an easy css fix for this?
When you build master and launch the application, you'll get a javascript error in the browser. The problem is that some coffeescript gets translated into incorrect javascript:
angular.module('mindbender.dashboard', ['ui.ace', 'ui.bootstrap', 'ui.sortable'].service('Dashboard', ...
Note that there's a closing parenthesis missing after 'ui.sortable']
. This is because the coffeescript compiler used is version 1.6.3 which doesn't correctly handle angularjs code.
The correct version 1.8.0 is specified in gui/frontend/bower.json
, but it never gets called. Instead, an npm package (karma) forces installation of coffeescript 1.6.3 into gui/frontend/node_modules
which is then used in the build.
There's an easy fix: Add "coffee-script": "1.8.0"
to gui/frontend/package.json
in section devDependencies
.
(There's also a second error that this fixes: the older coffeescript didn't support the power operator, eg. 3 ** 3
)
In the genepheno configuration:
Add template "variable". The fields "features_column" and "features_layout" are missing, but are present in the existing "variable" template configuration.
When using the v0.2.0 for recall mode it goes on well, but it seems that LATEST will break the normal functionality of recall mode. One cannot navigate through words in sentences.
If we can easily label all occurrences of a same phrase, that would be very useful for productivity.
Learned from @netj that MB Search works only with Postgresql 9.3 (and not PGXL or Greenplum) because it uses the to_json
function.
How hard is it to drop the requirement for to_json
? Or at least ship a basic version of it. E.g., the following could be used to patch PGXL:
http://www.pgxn.org/dist/json_enhancements/doc/json_enhancements.html
https://bitbucket.org/IVC-Inc/json_enhancements/overview
An idea of short keys: one could allow different labels associated with one shortcut key, and switching between different labels if this key pressed multiple times.
e.g. address
and amount
both have shortkey a
, and
a
pressed once: address
selected.
a
pressed twice: amount
selected.
It's a little hard to explain, but I'll try here:
If one can dump initial labels from database, and later only need to modify existing labels, that would be very useful. Now we can dump extractions and highlight them as a hint, but it would be nicer to just use these extractions to generate initial tags.
It would be also nice to be able to compute the diff with original tags, to understand precision / recall errors the system is making. e.g. all tags in PERSON that exist in the manual label but not the extractions label are recall errors. all tags in PERSON that exist in the extractions but not manual labels are precision errors. Ultimately there could be a panel showing the P/R for all relations.
A cleaner design might be: enable users to specify one or multiple extraction file(s), say extractions.csv
. Before the first labeling, one can click a button to initialize all manual tags from the extractions. These extractions files are dumped after each run of DeepDive. With these files one can compue P/R of the current system, based on automatic comparison with the labeled tags.json
. The P/R panel could be added to the interface.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.