hazyresearch / deepdive Goto Github PK

View Code? Open in Web Editor NEW

2.0K 2.0K 539.0 509.78 MB

DeepDive

Home Page: deepdive.stanford.edu

Makefile 3.73% Python 8.51% Shell 51.95% JSONiq 1.73% Perl 0.94% Scala 11.92% C++ 21.22%

deepdive's People

Contributors

Stargazers

Watchers

Forkers

netj guotai kwangnam pjpjean doxav marcelo-amancio zhangce senwu zifeishan larryxiao wycg1984 emjunsun shyamalschandra rashoodkhan dacort zixan neuroradiology md11235 ramanathanr folcon weishishuo agibsonccc applied-duality eric011 gijs hcoona blazy2k9 cumeadi codiferent frictionlesscoin phamnamkhanh eerwitt ldenevi xwiz queiroz25 eoinhurrell ashlabs cybermascot ibict c0d3rm0nk3y iamsile vikeshkhanna nohaelprince chagge trh3 voidexception amoliu heshizhu dboyliao lumiqai fanfannothing neozhangthe1 narayana1208 mingleili virtualman2000 wanghaisheng anandsrao leftshifters laranea yanqingmen yliuhb chengc017 illidus yamokosk zipengliu beowulf-io xiaoling giantoak adeze mfabalisi charles-cai mimaima ericywon tengteng gaapt devsinghsachan taocao davgit ashokpant mudit2013 huskyeder vibster vdmanthan curtiszimmerman itmib aboudiop jyotilakra92 nagyistoce imaxxs meego jwills he0x leochencipher mindis mr-justin aashish24 agur thodrek yukezhu kconor

deepdive's Issues

Throw meaningful errors when SQL extraction query is invalid

holdout_query must ends with ";"

I found that holdout_query must be ended with ';' for now.
If I use the following:

holdout_query: "INSERT INTO dd_graph_variables_holdout(variable_id) select id from candidate where docid in (select docid from eval_docs)"

The SQL query will be:

21:59:06 [] ERROR SQL execution failed (Reason: ERROR: syntax error at or near "UPDATE"
  Position: 122):

   DROP TABLE IF EXISTS candidate_label_cardinality CASCADE;CREATE TABLE candidate_label_cardinality(candidate_label_cardinality) AS VALUES (1) WITH DATA;INSERT INTO dd_graph_variables_map(variable_id) SELECT id FROM dd_graph_variables;INSERT INTO dd_graph_variables_holdout(variable_id) select id from candidate where docid in (select docid from eval_docs)UPDATE dd_graph_variables SET is_evidence=false WHERE dd_graph_variables.id IN (SELECT variable_id FROM dd_graph_variables_holdout)

21:59:07 [inferenceManager] ERROR ERROR: syntax error at or near "UPDATE"
  Position: 122

Apparently the system missed the ";" after holdout_query. Should be easy to fix.

Fix tests for json_extractor

In current tests it assume that DeepDive will auto-assign ID for developers if "id" is not returned in JSON. We cancelled that in the newest commit 5a6d651.

Need to fix tests to cancel this assumption: https://travis-ci.org/HazyResearch/deepdive/jobs/24320060

Variable ID Issue

Sampler requires globally unique ID column for all tables containing variable.
New grounding no longer reorder variable IDs, but only use ID columns of different tables, therefore IDs in different tables must be globally unique before grounding.
We may reassign globally unique IDs before grounding, but currently IDs are explicit to users so that reassignment may break other references.
default extractor assumes output relation has ID, and will automatically append ID column to the output JSON. (see: def buildCopySql in PostgresExtractionDataStore.scala)
"bigserial" is slow in large-scale applications.

We must fix this before next code push.
@zhangce @feiranwang @msushkov

Nested Example

change

deepdive.extractions: {
  wordsExtractor.style: "udf_extractor"
  wordsExtractor.output_relation: "words"
  wordsExtractor.input: "SELECT * FROM titles"
  wordsExtractor.udf: "words.py"
}

deepdive.extraction.extractors: {
  wordsExtractor {
    style: "udf_extractor"
    output_relation: "words"
    input: "SELECT * FROM titles"
    udf: "words.py"
  }
}

(1) extractions->extraction.extractors; (2) nested is easier for reading

In all documentations on web.

Warn / report error on bad configuration

There should be error messages when encountering unexpected configuration items.

Just now I mistyped "dependencies" as "depencencies", and there is no error message on parsing the config file. Therefore the dependencies is broken but programmers do not know what cause the problem.

Strongly suggest that unexpected configs should be abandoned, or at least warned.

Multinomial example application

Code review: grounding

Documentation: new configurations supported

New configurations supported
- skip_learning
- weight_table
- relearn_from

Log is empty after system is killed ??

When system is running I can see the log in log/2014-XX..XX.txt,

but after I send a SIGINT (CTRL+C), the log becomes empty. Frustrating...

Which configuration file should we set the learning rate?

For such a case, let’s change the learning rate from the default of “0.1” to the “0.001” by adding the following sampler options to the configuration file:

sampler.sampler_args: "-l 125 -s 1 -i 200 --alpha 0.001"

Error with ocr example in develop branch

Error in develop branch but not in master.

14:05:38.050 [default-dispatcher-2][profiler][Profiler] DEBUG starting report_id=inference_grounding
14:05:38.051 [default-dispatcher-3][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore(akka://deepdive)][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore] INFO  Writing grounding queries to file="/var/folders/rz/0l6t9_w90hs_k6l6fq7nlsxm0000gn/T/grounding8297874664321351755.sql" 
14:05:38.052 [default-dispatcher-6][taskManager][TaskManager] INFO  Added task_id=inference
14:05:38.053 [default-dispatcher-6][taskManager][TaskManager] INFO  0/1 tasks eligible.
14:05:38.053 [default-dispatcher-6][taskManager][TaskManager] INFO  Tasks not_eligible: Set(inference)
14:05:38.054 [default-dispatcher-6][taskManager][TaskManager] INFO  Added task_id=calibration
14:05:38.054 [default-dispatcher-6][taskManager][TaskManager] INFO  0/2 tasks eligible.
14:05:38.055 [default-dispatcher-6][taskManager][TaskManager] INFO  Tasks not_eligible: Set(inference, calibration)
14:05:38.056 [default-dispatcher-6][taskManager][TaskManager] INFO  Added task_id=report
14:05:38.057 [default-dispatcher-6][taskManager][TaskManager] INFO  0/3 tasks eligible.
14:05:38.058 [default-dispatcher-6][taskManager][TaskManager] INFO  Tasks not_eligible: Set(inference, report, calibration)
14:05:38.058 [default-dispatcher-6][taskManager][TaskManager] INFO  Added task_id=shutdown
14:05:38.059 [default-dispatcher-6][taskManager][TaskManager] INFO  0/4 tasks eligible.
14:05:38.059 [default-dispatcher-6][taskManager][TaskManager] INFO  Tasks not_eligible: Set(shutdown, inference, report, calibration)
14:05:38.076 [default-dispatcher-3][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore(akka://deepdive)][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore] INFO  Executing grounding query...
14:05:38.351 [][][StatementExecutor$$anon$1] ERROR SQL execution failed (Reason: ERROR: invalid input syntax for integer: ""
  Position: 184):

   INSERT INTO dd_graph_weights(initial_value, is_fixed, description) SELECT DISTINCT 0.0 AS wValue, false AS wIsFixed, 'label1-' || (CASE WHEN "features.feature_id" IS NULL THEN '' ELSE "features.feature_id" END) || "label1_val_cardinality" AS wCmd FROM label1_query GROUP BY wValue, wIsFixed, wCmd

14:05:38.370 [default-dispatcher-3][inferenceManager][OneForOneStrategy] ERROR ERROR: invalid input syntax for integer: ""
  Position: 184
org.postgresql.util.PSQLException: ERROR: invalid input syntax for integer: ""
  Position: 184
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2157) ~[postgresql-9.2-1003-jdbc4.jar:na]
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886) ~[postgresql-9.2-1003-jdbc4.jar:na]
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255) ~[postgresql-9.2-1003-jdbc4.jar:na]
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:555) ~[postgresql-9.2-1003-jdbc4.jar:na]
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:417) ~[postgresql-9.2-1003-jdbc4.jar:na]
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:410) ~[postgresql-9.2-1003-jdbc4.jar:na]
    at org.apache.commons.dbcp.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:172) ~[commons-dbcp-1.4.jar:1.4]
    at org.apache.commons.dbcp.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:172) ~[commons-dbcp-1.4.jar:1.4]
    at scalikejdbc.StatementExecutor$$anonfun$execute$1.apply$mcZ$sp(StatementExecutor.scala:295) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$$anonfun$execute$1.apply(StatementExecutor.scala:295) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$$anonfun$execute$1.apply(StatementExecutor.scala:295) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$NakedExecutor.apply(StatementExecutor.scala:33) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$$anon$1.scalikejdbc$StatementExecutor$LoggingSQLAndTiming$$super$apply(StatementExecutor.scala:291) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$LoggingSQLAndTiming$class.apply(StatementExecutor.scala:238) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$$anon$1.scalikejdbc$StatementExecutor$LoggingSQLIfFailed$$super$apply(StatementExecutor.scala:291) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$LoggingSQLIfFailed$class.apply(StatementExecutor.scala:269) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor$$anon$1.apply(StatementExecutor.scala:291) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.StatementExecutor.execute(StatementExecutor.scala:295) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DBSession$$anonfun$executeWithFilters$1.apply(DBSession.scala:248) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DBSession$$anonfun$executeWithFilters$1.apply(DBSession.scala:246) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.LoanPattern$.using(LoanPattern.scala:29) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.package$.using(package.scala:76) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DBSession$class.executeWithFilters(DBSession.scala:245) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.ActiveSession.executeWithFilters(DBSession.scala:420) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.SQLExecution.apply(SQL.scala:441) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at org.deepdive.inference.SQLInferenceDataStore$$anonfun$4$$anonfun$apply$4.apply(SQLInferenceDataStore.scala:39) ~[classes/:na]
    at org.deepdive.inference.SQLInferenceDataStore$$anonfun$4$$anonfun$apply$4.apply(SQLInferenceDataStore.scala:38) ~[classes/:na]
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) ~[scala-library.jar:0.13.1]
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) ~[scala-library.jar:0.13.1]
    at org.deepdive.inference.SQLInferenceDataStore$$anonfun$4.apply(SQLInferenceDataStore.scala:38) ~[classes/:na]
    at org.deepdive.inference.SQLInferenceDataStore$$anonfun$4.apply(SQLInferenceDataStore.scala:37) ~[classes/:na]
    at scalikejdbc.DBConnection$$anonfun$autoCommit$1.apply(DB.scala:185) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DBConnection$$anonfun$autoCommit$1.apply(DB.scala:184) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.LoanPattern$.using(LoanPattern.scala:29) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.package$.using(package.scala:76) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DBConnection$class.autoCommit(DB.scala:184) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DB.autoCommit(DB.scala:498) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DB$$anonfun$autoCommit$2.apply(DB.scala:641) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DB$$anonfun$autoCommit$2.apply(DB.scala:640) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.LoanPattern$.using(LoanPattern.scala:29) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.package$.using(package.scala:76) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at scalikejdbc.DB$.autoCommit(DB.scala:640) ~[scalikejdbc_2.10-1.7.4.jar:1.7.4]
    at org.deepdive.inference.SQLInferenceDataStore$class.execute(SQLInferenceDataStore.scala:37) ~[classes/:na]
    at org.deepdive.inference.PostgresInferenceDataStoreComponent$PostgresInferenceDataStore.execute(PostgresInferenceDataStore.scala:19) ~[classes/:na]
    at org.deepdive.inference.SQLInferenceDataStore$class.groundFactorGraph(SQLInferenceDataStore.scala:536) ~[classes/:na]
    at org.deepdive.inference.PostgresInferenceDataStoreComponent$PostgresInferenceDataStore.groundFactorGraph(PostgresInferenceDataStore.scala:19) ~[classes/:na]
    at org.deepdive.inference.InferenceManager$$anonfun$receive$1.applyOrElse(InferenceManager.scala:59) ~[classes/:na]
    at akka.actor.Actor$class.aroundReceive(Actor.scala:467) ~[akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at org.deepdive.inference.InferenceManager$PostgresInferenceManager.aroundReceive(InferenceManager.scala:116) ~[classes/:na]
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:491) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.actor.ActorCell.invoke(ActorCell.scala:462) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.dispatch.Mailbox.run(Mailbox.scala:219) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:385) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library.jar:na]
14:05:38.372 [default-dispatcher-6][inferenceManager][InferenceManager$PostgresInferenceManager] INFO  Starting
14:05:38.372 [default-dispatcher-3][factorGraphBuilder][FactorGraphBuilder$PostgresFactorGraphBuilder] INFO  Starting

Add result of EXPLAIN to log

For input SQL statements to extractors

EdgeCount for variables in grounding output is set to -1

Specify holdout using pure SQL

Proofreading: documentation of extractors

extractors (Done in http://deepdive.stanford.edu/doc/extractors.html; needs review and improvements)
reviewers: @feiranwang @msushkov @senwu

Allow SQL statement in "before" clause of extractors

Instead of an executable

Allow extractors to output to multiple relations

All data of an extractor is currently written to the relation specified in the output_relation setting. It would be useful to allow extractors to write to multiple relations. One way to implement this would be to allow a _relation key in the JSON output and use that value for grouping.

Real number factors

Empty weight if variables are null

When a variable in a weight rule is null, the weight will become null. Not intended.

Test examples/smoke

Multinomial Support

Spouse example: tsv_extractor number of rows mismatch

May have neglected some rows due to unknown parsing issues of TSV.

deepdive_spouse_tsv=# select count(*) from has_spouse_features ;                                                                                                                       count
--------
 151808
(1 row)

deepdive_spouse_tsv=# select count(*) from has_spouse;
 count
-------
 75446
(1 row)

deepdive_spouse_tsv=# select count(*) from people_mentions ;
 count
-------
 39269
(1 row)

Correct number should be:

deepdive_spouse_plpy=# select count(*) from has_spouse_features;
 count
--------
 151824
(1 row)

deepdive_spouse_plpy=# select count(*) from has_spouse;
 count
-------
 75454
(1 row)

deepdive_spouse_plpy=# select count(*) from people_mentions ;
 count
-------
 39270
(1 row)

(tested on other two extractors)

Get rid of tmp bash script in plpy_extractor

Use jdbc rather than bash to execute SQL commands, to enable error catching.

Do not do inference in pipeline if there's no rule

without inference rule, the system should not do inference at all. (extract only)
Now it has error message like:

22:26:52 [inferenceManager] ERROR /afs/cs.stanford.edu/u/zifei/repos/deepdive/out/2014-04-21T222429/graph.weights (No such file or directory)

Test examples/nlp_extractor

Need to be coherent with ID conventions

If multiple variables in same table, tables must be renamed in variable schema to prevent ID conflict

I tried to update the smoke example for develop branch. I changed the syntax to current setting, but the grounding SQL script failed here:

INSERT INTO dd_graph_variables(id, data_type, initial_value, is_evidence, cardinality)
        SELECT people.id, 'Boolean', people.smokes::int, (people.smokes IS NOT NULL), null
        FROM people;

        DROP TABLE IF EXISTS people_smokes_cardinality CASCADE;
CREATE TABLE people_smokes_cardinality(people_smokes_cardinality) AS VALUES (1) WITH DATA;
INSERT INTO dd_graph_variables(id, data_type, initial_value, is_evidence, cardinality)
        SELECT people.id, 'Boolean', people.has_cancer::int, (people.has_cancer IS NOT NULL), null
        FROM people;

        DROP TABLE IF EXISTS people_has_cancer_cardinality CASCADE;
CREATE TABLE people_has_cancer_cardinality(people_has_cancer_cardinality) AS VALUES (1) WITH DATA;
INSERT INTO dd_graph_variables_map(variable_id)
      SELECT id FROM dd_graph_variables;
INSERT INTO dd_graph_variables_holdout(variable_id)
        SELECT id FROM dd_graph_variables
        WHERE RANDOM() < 0.0 AND is_evidence = true;
UPDATE dd_graph_variables SET is_evidence=false
      WHERE dd_graph_variables.id IN (SELECT variable_id FROM dd_graph_variables_holdout);

The error is:

21:53:48.558 [default-dispatcher-2][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore(akka://deepdive)][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore] INFO  Executing grounding query...
21:53:57.533 [][][StatementExecutor$$anon$1] ERROR SQL execution failed (Reason: ERROR: duplicate key violates unique constraint "dd_graph_variables_pkey"  (seg20 rulk.stanford.edu:40000 pid=25436)):

   DROP TABLE IF EXISTS people_smokes_cardinality CASCADE;CREATE TABLE people_smokes_cardinality(people_smokes_cardinality) AS VALUES (1) WITH DATA;INSERT INTO dd_graph_variables(id, data_type, initial_value, is_evidence, cardinality) SELECT people.id, 'Boolean', people.has_cancer::int, (people.has_cancer IS NOT NULL), null FROM people

21:53:57.558 [default-dispatcher-2][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore(akka://deepdive)][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore] ERROR org.postgresql.util.PSQLException: ERROR: duplicate key violates unique constraint "dd_graph_variables_pkey"  (seg20 rulk.stanford.edu:40000 pid=25436)
21:53:57.559 [default-dispatcher-2][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore(akka://deepdive)][PostgresInferenceDataStoreComponent$PostgresInferenceDataStore] INFO  [Error] Please check the SQL cmd!
21:53:57.644 [default-dispatcher-5][inferenceManager][OneForOneStrategy] ERROR ERROR: duplicate key violates unique constraint "dd_graph_variables_pkey"  (seg20 rulk.stanford.edu:40000 pid=25436)
org.postgresql.util.PSQLException: ERROR: duplicate key violates unique constraint "dd_graph_variables_pkey"  (seg20 rulk.stanford.edu:40000 pid=25436)
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2157) ~[postgresql-9.2-1003-jdbc4.jar:na]

What seeems to cause the error is that variable "smokes" and "has_cancer" are in the same table, and system tries to use the row ID as variable ID, but it fails since variables cannot have duplicate IDs...

Any suggestions?

Function executeSql Bug (in new sql_extractor)

There are potential errors in the new function executeSql in src/main/scala/org/deepdive/extraction/ExtractorRunner.scala.

Is the last commit well-tested? @senwu

I do sql:"select * from articles limit 10;" in an extractor and style:"sql_extractor", and the error goes like below:

22:04:29 [PostgresExtractionDataStore(akka://deepdive)] ERROR org.postgresql.util.PSQLException: A result was returned when none was expected.
22:04:29 [PostgresExtractionDataStore(akka://deepdive)] INFO  [Error] Please check the SQL cmd!
22:04:29 [extractorRunner-ext_test_sql] ERROR A result was returned when none was expected.
org.postgresql.util.PSQLException: A result was returned when none was expected.

When I try to build my own code on this function, I also got errors like:

21:43:47 [PostgresExtractionDataStore(akka://deepdive)] ERROR org.postgresql.util.PSQLException: No value specified for parameter 1.

@dennybritz : What is the right way to execute a sql query somewhere other than grounding?

Error thrown during sampling. java.lang.UnsupportedOperationException: empty.reduceLeft

I am getting the following error, although the setup is nearly identical to the deepdive_spouse example, and all the dd factors in postgreSQL are full (included below).

The application.conf file is available at https://github.com/tomMulholland/isDB

17:46:26.559 [Thread-23][sampler][Sampler] INFO  17:46:26.559 [main] DEBUG org.dennybritz.sampler.Runner$ - Creating factor graph...
17:46:26.640 [Thread-23][sampler][Sampler] INFO  17:46:26.639 [main] DEBUG org.dennybritz.sampler.Runner$ - Starting learning phase...
17:46:27.586 [Thread-23][sampler][Sampler] INFO  17:46:27.585 [main] DEBUG org.dennybritz.sampler.Learner - num_iterations=120
17:46:27.587 [Thread-23][sampler][Sampler] INFO  17:46:27.585 [main] DEBUG org.dennybritz.sampler.Learner - num_samples_per_iteration=1
17:46:27.587 [Thread-23][sampler][Sampler] INFO  17:46:27.586 [main] DEBUG org.dennybritz.sampler.Learner - learning_rate=0.1
17:46:27.588 [Thread-23][sampler][Sampler] INFO  17:46:27.587 [main] DEBUG org.dennybritz.sampler.Learner - diminish_rate=0.95
17:46:27.588 [Thread-23][sampler][Sampler] INFO  17:46:27.587 [main] DEBUG org.dennybritz.sampler.Learner - regularization_constant=0.01
17:46:27.589 [Thread-23][sampler][Sampler] INFO  17:46:27.587 [main] DEBUG org.dennybritz.sampler.Learner - num_factors=267260 num_query_factors=75456
17:46:27.590 [Thread-23][sampler][Sampler] INFO  17:46:27.587 [main] DEBUG org.dennybritz.sampler.Learner - num_weights=143009 num_query_weights=49011
17:46:27.590 [Thread-23][sampler][Sampler] INFO  17:46:27.587 [main] DEBUG org.dennybritz.sampler.Learner - num_query_variables=1791 num_evidence_variables=1227
17:46:27.751 [Thread-23][sampler][Sampler] INFO  17:46:27.750 [main] DEBUG org.dennybritz.sampler.Learner - iteration=0 learning_rate=0.1
Exception in thread "main" scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel computation: java.lang.UnsupportedOperationException: empty.reduceLeft
scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:124)
scala.collection.immutable.List.reduceLeft(List.scala:84)
scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195)
scala.collection.AbstractTraversable.reduce(Traversable.scala:105)
org.dennybritz.sampler.SamplingUtils$.sampleVariable(SamplingUtils.scala:34)
org.dennybritz.sampler.SamplingUtils$$anonfun$sampleVariables$1.apply$mcVI$sp(SamplingUtils.scala:42)
org.dennybritz.sampler.SamplingUtils$$anonfun$sampleVariables$1.apply(SamplingUtils.scala:42)
org.dennybritz.sampler.SamplingUtils$$anonfun$sampleVariables$1.apply(SamplingUtils.scala:42)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.parallel.immutable.ParHashSet$ParHashSetIterator.foreach(ParHashSet.scala:76)
.
.
.
    at scala.collection.parallel.package$$anon$1.alongWith(package.scala:85)
    at scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)
    at scala.collection.parallel.ParIterableLike$Foreach.mergeThrowables(ParIterableLike.scala:972)
    at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)
    at scala.collection.parallel.ParIterableLike$Foreach.tryMerge(ParIterableLike.scala:972)
    at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)
    at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
    at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
    at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
    at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
17:46:28.161 [default-dispatcher-11][inferenceManager][OneForOneStrategy] ERROR sampling failed (see error log for more details)
java.lang.RuntimeException: sampling failed (see error log for more details)
    at org.deepdive.inference.Sampler$$anonfun$receive$1.applyOrElse(Sampler.scala:36) ~[classes/:na]
    at akka.actor.Actor$class.aroundReceive(Actor.scala:467) ~[akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at org.deepdive.inference.Sampler.aroundReceive(Sampler.scala:17) ~[classes/:na]
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:491) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.actor.ActorCell.invoke(ActorCell.scala:462) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.dispatch.Mailbox.run(Mailbox.scala:219) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:385) [akka-actor_2.10-2.3-M2.jar:2.3-M2]
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library.jar:na]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library.jar:na]
17:46:28.164 [default-dispatcher-11][sampler][LocalActorRef] INFO  Message [akka.actor.PoisonPill$] from Actor[akka://deepdive/user/inferenceManager#-1596663203] to Actor[akka://deepdive/user/inferenceManager/sampler#-1865594242] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
17:46:28.165 [default-dispatcher-4][inferenceManager][InferenceManager$PostgresInferenceManager] INFO  Starting
17:46:28.166 [default-dispatcher-11][factorGraphBuilder][FactorGraphBuilder$PostgresFactorGraphBuilder] INFO  Starting
17:46:56.074 [default-dispatcher-4][taskManager][TaskManager] INFO  Memory usage: 213/962MB (max: 962MB)

DeepDive variables are full.

isDB=# SELECT schemaname,relname,n_live_tup 
isDB-#   FROM pg_stat_user_tables 
isDB-#   ORDER BY n_live_tup DESC;
 schemaname |                 relname                 | n_live_tup 
------------+-----------------------------------------+------------
 public     | schol_features                          |     267260
 public     | f_is_schol_features_query               |     267260
 public     | selectedgesfordumpsql_raw               |     267260
 public     | dd_graph_edges                          |     267260
 public     | selectfactorsfordumpsql_raw             |     267260
 public     | dd_graph_factors                        |     267260
 public     | dd_graph_weights                        |     143009
 public     | selectweightsfordumpsql_raw             |     143009
 public     | selectvariablesfordumpsql_raw           |       3018
 public     | dd_graph_variables                      |       3018
 public     | scholarships                            |       3018
 public     | dd_graph_variables_map                  |       3018
 public     | websites                                |        852
 public     | schol_int_study                         |        645
 public     | financial_aid                           |        489
 public     | dd_graph_variables_holdout              |        269
 public     | factornum                               |          2
 public     | scholarships_is_scholarship_cardinality |          1
(18 rows)

 nfactor 
---------
       0
  267260
(2 rows)

Example of Custom Holdout Query?

Can you show me an example in application.conf to use the new custom holdout query? I need that pretty much. Thanks!

Update AMI with JDK7

Test examples/spouse

NOTE: This example wrongly refer to variable "id"s. Need rewriting.
@feiranwang @msushkov

Implement relearn_from and weight_table

Recover the views old system can give users, and enable users to reuse commands like relearn_from and weight_table.

Extract-only Pipelines

Users should be able to only perform extractions, while skipping grounding, learning, inference and calibration. Or a more flexible pipeline should be supported.

udf_extractor -> json_extractor

udf_extractor (the default one) is really a bad naming, since plpy/tsv extractors also have "udf".

We'll recently change all of them to json_extractor.

GP parallel load/unload for tsv_extractor

Greenplum parallel load / unload is not implemented yet in tsv_extractor

Catch the exception of no database connection

There is no exception catch for database connection error, and the program runs forever...

Configuration sanity check

We should do a sanity check on the configuration upon loading. Instead of the application crashing in the middle of execution due to a configuration issue, we should immediately exit if we find an obvious mistake. Some things we can check for:

Does the schema match the variables used in factor functions?
Do all extractor UDF files exist?
Do all extractor output relations exist?

There probably are more things that we can check for.

Automatic recommendations based on calibration data

Currently, user need to interpret the calibration plots manually. It would be great if we could automatically give them recommendations based on the calibration data.

Code review: new extractors

code review about following components:
- New extractor path 1: plpy_extractor
- New extractor path 2: tsv_extractor
- Extractor path 3: sql_extractor
- Extractor path 4: cmd_extractor

Merge grounding code into develop branch

A general version not optimized for greenplum will be fine.

Documentation for IE features

What are common features for IE applications? Dependency paths, etc.

Before and after scripts with cmd_extractor

Model examples

MRF
Linear CRF
Skip-chain CRF
Bayes Net
LDA
Min-Cut
Correlation Clustering