propi / rdfrules Goto Github PK

RDFRules: Analytical Tool for Rule Mining from RDF Knowledge Graphs

License: GNU General Public License v3.0

Scala 98.73% HTML 0.04% JavaScript 0.06% Less 1.17%

data-analysis data-mining data-science dataset gui knowledge-base knowledge-based-systems knowledge-discovery knowledge-graph postprocessing preprocessing rdf rest-api rule-engine rule-mining rules rules-engine scala semantic-web

rdfrules's People

Contributors

Stargazers

Watchers

Forkers

kizi sandy4321

rdfrules's Issues

new Thread instead of Future in Workers

Deserialization exception - Invalid type of measure

If json-serialized rules contain confidence or some other measure, they cannot be deserialized via Load ruleset due to Deserialization exception - Invalid type of measure.

rules.json
[ { "body": [ { "object": { "type": "variable", "value": "?a" }, "predicate": "<interacts_with>", "subject": { "type": "variable", "value": "?b" } } ], "head": { "object": { "type": "variable", "value": "?b" }, "predicate": "<interacts_with>", "subject": { "type": "variable", "value": "?a" } }, "measures": [ { "name": "BodySize", "value": 11702212 }, { "name": "HeadCoverage", "value": 0.9917442958647477 }, { "name": "Support", "value": 11605602 }, { "name": "HeadSize", "value": 11702212 }, { "name": "Confidence", "value": 0.9917442958647477 } ] } ]

Add indicative progress indicator

It would help if there was support for approximate progress indicator for the mine task (number of rules processed + possibly estimate based on the time required to process rules so far)

Add REST and web UI

State default values of parameters in GUI Mine node

Some thresholds in the Mine node are effective also when not present - defaults apply.
This, e.g., affects "Min head size", which has a default of 100. The default values in effect should be communicated to the user.

Discretization during mining

Add Java interfaces

In HTTP module for task status request return only current logs (not all old logs)

This is problem with a default akka response limit if there are lots of logs.

Help hints in GUI

Add hints to parameters and operations

In GUI cache into memory

We need to resolve the lifetime (or idle time) of an index in the memory. Or set a limit in setting: max idle time for index.

save an index
remove the index

Schema support

It should be possible to attach a schema at a dataset. Then we can do some extended operations:

generate triples with types from ontology (domain, range)
working with sub/super types; add a new triples with super types

Show all facts covered by a rule

Add perfect rules constraint

After mining the cache method on the Ruleset may save rules into a new collection again

If the ruleset has indexedseq as col, it should not resave the rules into a new collection again during caching. Check it whether it behaves this way.

Instantiation does not work properly

Sometimes in results there are different predicates. Wrong task:

[
  {
    "name": "LoadGraph",
    "parameters": {
      "path": "/dbpedia_yago/mappingbased_objects_sample.ttl",
      "graphName": "<dbpedia>"
    }
  },
  {
    "name": "LoadGraph",
    "parameters": {
      "path": "/dbpedia_yago/yagoFacts.tsv",
      "graphName": "<yago>"
    }
  },
  {
    "name": "LoadGraph",
    "parameters": {
      "path": "/dbpedia_yago/yagoDBpediaInstances.tsv",
      "graphName": "<dbpedia>"
    }
  },
  {
    "name": "MergeDatasets",
    "parameters": {}
  },
  {
    "name": "AddPrefixes",
    "parameters": {
      "prefixes": [
        {
          "prefix": "dbo",
          "nameSpace": "http://dbpedia.org/ontology/"
        },
        {
          "prefix": "dbr",
          "nameSpace": "http://dbpedia.org/resource/"
        }
      ]
    }
  },
  {
    "name": "Index",
    "parameters": {
      "prefixedUris": true
    }
  },
  {
    "name": "Mine",
    "parameters": {
      "thresholds": [
        {
          "name": "TopK",
          "value": 1000
        },
        {
          "name": "MinHeadCoverage",
          "value": 0.01
        }
      ],
      "patterns": [],
      "constraints": [
        {
          "name": "WithoutConstants"
        }
      ]
    }
  },
  {
    "name": "CacheRuleset",
    "parameters": {
      "inMemory": true,
      "path": "e4790ffb-d535-4e14-9478-867d3f4abe2a",
      "revalidate": false
    }
  },
  {
    "name": "ComputePcaConfidence",
    "parameters": {
      "min": 0.5,
      "topk": 50
    }
  },
  {
    "name": "Sorted",
    "parameters": {}
  },
  {
    "name": "GraphBasedRules",
    "parameters": {}
  },
  {
    "name": "Instantiate",
    "parameters": {
      "rule": {
        "body": [
          {
            "graphs": [
              "<dbpedia>"
            ],
            "object": {
              "type": "variable",
              "value": "?c"
            },
            "predicate": {
              "localName": "album",
              "nameSpace": "http://dbpedia.org/ontology/",
              "prefix": "dbo"
            },
            "subject": {
              "type": "variable",
              "value": "?a"
            }
          },
          {
            "graphs": [
              "<yago>"
            ],
            "object": {
              "type": "variable",
              "value": "?c"
            },
            "predicate": "<created>",
            "subject": {
              "type": "variable",
              "value": "?b"
            }
          }
        ],
        "head": {
          "graphs": [
            "<dbpedia>"
          ],
          "object": {
            "type": "variable",
            "value": "?b"
          },
          "predicate": {
            "localName": "musicalBand",
            "nameSpace": "http://dbpedia.org/ontology/",
            "prefix": "dbo"
          },
          "subject": {
            "type": "variable",
            "value": "?a"
          }
        },
        "measures": [
          {
            "name": "HeadCoverage",
            "value": 0.4664823773324119
          },
          {
            "name": "HeadSize",
            "value": 2894
          },
          {
            "name": "PcaBodySize",
            "value": 1368
          },
          {
            "name": "Support",
            "value": 1350
          },
          {
            "name": "PcaConfidence",
            "value": 0.9868421052631579
          }
        ]
      },
      "part": "Whole"
    }
  },
  {
    "name": "GetRules",
    "parameters": {}
  }
]

In GUI and REST add revalidate checkbox

By default the revalidate checkbox will be unchecked and once the cache is used within the workflow the next usage will be loaded from the cache and all prepended operations will be omitted. If the revalidate is checked it performs all previous operations and creates the cache again.

Set limitations for workspace (during upload)

Set immutable and mutable folders + set temporary restrictions for uploaded file (e.g. max one week)... Set memory limitation for app, and restart http if overflowed memory. Show current state of the memory in GUI...

Predict triples operation should have a choice to predict only missing triples

Now in the result there are all predicted triples including triples which are already placed in the dataset

Continuously save the mined rules to disk during mining

It is needed if the mining crashes to have some results or for saving memory.

Add constraint to filter rules that do not improve the confidence w.r.t their parents

the strategy that avoids outputting rules that do not improve the confidence w.r.t their parents. These rules should be involved in refinement but not on the output. Maybe this functionality should be added into Ruleset

Check inUseInMemory index mode

Filter ruleset - only max or closure rules

Better logging

Do not show same message more times. Loading dataset logging is wrong, because it does not take into account more graphs and merging to one dataset. Resolve how to disable loading dataset logging if it is indexing since there are very annoying messages which show same for dataset and index loading:

2020-09-08T16:11:28.606Z : Action Dataset indexing, steps: 0 -- started
2020-09-08T16:11:28.612Z : Action Dataset loading, steps: 0 -- started
2020-09-08T16:11:28.612Z : Action Dataset indexing, steps: 0 -- started
2020-09-08T16:11:30.862Z : Action Dataset loading, steps: 14465 -- ended
2020-09-08T16:11:30.862Z : Action Dataset indexing, steps: 14465
2020-09-08T16:11:30.862Z : Action Dataset loading, steps: 0 -- started
2020-09-08T16:11:30.862Z : Action Dataset indexing, steps: 14465
2020-09-08T16:11:32.434Z : Action Dataset loading, steps: 18845 -- ended
2020-09-08T16:11:32.434Z : Action Dataset indexing, steps: 33310
2020-09-08T16:11:32.435Z : Action Dataset loading, steps: 0 -- started
2020-09-08T16:11:32.435Z : Action Dataset indexing, steps: 33310
2020-09-08T16:11:37.436Z : Action Dataset loading, steps: 20205
2020-09-08T16:11:37.436Z : Action Dataset indexing, steps: 53516
2020-09-08T16:11:42.437Z : Action Dataset loading, steps: 52228
2020-09-08T16:11:42.437Z : Action Dataset indexing, steps: 85539
2020-09-08T16:11:47.438Z : Action Dataset loading, steps: 81463
2020-09-08T16:11:47.438Z : Action Dataset indexing, steps: 114774
2020-09-08T16:11:52.440Z : Action Dataset loading, steps: 112571
2020-09-08T16:11:52.443Z : Action Dataset indexing, steps: 145882
2020-09-08T16:11:53.711Z : Action Dataset loading, steps: 121437 -- ended
2020-09-08T16:11:53.712Z : Action Dataset indexing, steps: 154747
2020-09-08T16:11:53.765Z : Action Dataset indexing, steps: 154747 -- ended
2020-09-08T16:11:53.766Z : Action SameAs resolving, steps: 0 -- started
2020-09-08T16:11:54.195Z : Predicates trimming.
2020-09-08T16:11:54.195Z : Action SameAs resolving, steps: 0 -- ended
2020-09-08T16:11:54.318Z : Action Subjects indexing, steps: 0 -- started
2020-09-08T16:11:54.878Z : Subjects trimming.
2020-09-08T16:11:54.878Z : Action Subjects indexing, steps: 140281 -- ended
2020-09-08T16:11:54.948Z : Action Objects indexing, steps: 0 -- started
2020-09-08T16:11:55.341Z : Objects trimming.
2020-09-08T16:11:55.341Z : Action Objects indexing, steps: 140281 -- ended
2020-09-08T16:11:55.407Z : Action Amie rules mining, steps: 0 -- started
2020-09-08T16:11:55.407Z : Action Browsed projections large buckets, steps: 0 -- started
2020-09-08T16:11:55.407Z : Action Amie rules mining, steps: 0 -- started
2020-09-08T16:12:00.423Z : Action Browsed projections large buckets, steps: 0 -- started
2020-09-08T16:12:00.425Z : Action Amie rules mining, steps: 3500 -- processed rules, found closed rules: 1086, queue size: 7609
2020-09-08T16:12:05.436Z : Action Browsed projections large buckets, steps: 0 -- started
2020-09-08T16:12:05.436Z : Action Amie rules mining, steps: 9053 -- processed rules, found closed rules: 2032, queue size: 1631
2020-09-08T16:12:08.207Z : Action Browsed projections large buckets, steps: 0 -- ended
2020-09-08T16:12:08.207Z : Action Amie rules mining, steps: 10206 -- processed rules, found closed rules: 2242, queue size: 0
2020-09-08T16:12:08.208Z : Action Amie rules mining, steps: 10206 -- ended
2020-09-08T16:12:08.261Z : Action PCA Confidence computing, steps: 0 of 1000, progress: 0.0% -- started
2020-09-08T16:12:09.343Z : Action PCA Confidence computing, steps: 1000 of 1000, progress: 100.0% -- ended

Loading a task in GUI has ambiguous behaviour

For example: CacheDataset is action or transformation. During loading it is impossible to distinguish action from transformation with same name.

Empty cache file can cause pipeline to fail without log notices

This behaviour can be reproduced

load attached task.json.
create empty file "rulesPCA" in the workspace
run the pipeline

What will happen:
Indexing will not start. The log messages shown are:

2020-09-11 14:55:25:461 +0200 [rdfrules-http-akka.actor.default-dispatcher-6] INFO com.github.propi.rdfrules.http.InMemoryCache - Some value with key '025495e4-8e84-4f30-bfec-325a18dd3499x' was pushed into the memory cache. Number of items in the cache is: 1
2020-09-11 14:56:16:359 +0200 [Thread-1] INFO task-7216ee1d-9a9b-4286-bad1-0425e3c6b6fc - Predicates trimming.
2020-09-11 14:56:16:363 +0200 [rdfrules-http-akka.actor.default-dispatcher-9] INFO akka.actor.LocalActorRef - Message [com.github.propi.rdfrules.http.service.Task$TaskRequest$AddMsg] to Actor[akka://rdfrules-http/user/task-service/task-7216ee1d-9a9b-4286-bad1-0425e3c6b6fc#-1209497236] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://rdfrules-http/user/task-service/task-7216ee1d-9a9b-4286-bad1-0425e3c6b6fc#-1209497236] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

A workaround is either to delete the empty cache file "rulesPCA" or to set the last Cache node in the pipeline to "revalidate" the cache.

task.json.zip

Turn off minimal confidence limit in GUI 0.01

GUI - sort by metrics and metrics consistent ordering in the metric list

CSV support

Error if the workspace is empty

In GUI if the workspace is empty it returns javascript error.

Index object should have useMapper same as Resultset

useMapper[T](mapper => index => T)

Comparison of the current method with a new proposal

we need not specify dangling variable
first count support for all dangling atoms without instances
if support is greter then threshold count instances for danglings
instead of specifyAtom create function specifyVariable

Memory estimation by dataset size and create limits

Estimate memory needed for storing dataset into index. Set limits, e.g., max 1GB = N quads...
There can be some upper limit in combination with System.gc. Once we are closer to the limit then we stop loading the index.

Other restrictions in setting:

max quads
max triple item size
max memory during indexing (or at all)
max mined rules

Slow actor debugger

LinkedBlockingQueue is the bottleneck. Try to implement "non-blocking" debugger. One message with counter instead of queue. Thread can sleep just 5 sec and then read the current message.

Mine will return maximum 10.000 rules

It seems that the output of Mine is capped to 10.000 rules (if Top-K is not used). If Top-k is used, any value set to a higher value seems to be automatically redefined to 10.000.

Search in discovered rules is limited to one atom

It is not apparently possible to search for rule based on matches in two or more atoms.
Atoms in rules are separated by space, but searching for space works only inside atoms.

instantiation of "body" and "head" return unexpected results

The instantiation of "body" and "head" seem to return the same result as the instantiation of the whole rule.

Better mining debugging

Separate debugging to stages and offer progress bar based on the queue size for each stage.

add json task command line support

It would be convenient to be able to run RDFRules, e.g., as

java -jar RDFRulesLauncher.jar "task.json"
Where task.json would be generated in the GUI, or modified based on a task.json generated in the GUI.
This could supersede the Java API.

sync with new SBT version

It seems that SBT is not compatible with jdk 13 and 14.
sbt/sbt#5509 ("We don't test sbt on JDK 14, so that could also be the problem. Please run it on JDK 8 or 11.")
If this is true, the documentation should warn about this.
For me, it works with JDK 11.

Also, the run-main command on RDFRules homepage does not seem to work with current version of SBT - clulab/eidos#440.
It seems it was replaceed by runMain.

Not to involve pruned head triples in the refining phase

One some head triples are not mapped to body (they are pruned), we need not involve them in the next refine phase.

If the A_r set is empty, the current binding of the head s,p,o can be omitted within any other refinements of subsequent rules having the basis of the current rule.

When RDFRules runs out of memory, worker threads are not terminated

When RDFRules runs out of memory (GC overhead limited exceeded), worker threads are not terminated and the load of all CPU cores remains at 100%.

020-09-16 11:34:59:780 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16664 (0.06 per sec) -- processed rules, found closed rules: 25535936, queue size: 25577603, stage: 2, activeThreads: 6 Exception in thread "Thread-40" java.lang.OutOfMemoryError: GC overhead limit exceeded at com.github.propi.rdfrules.algorithm.amie.RuleFilter$RulePatternFilter$$Lambda$1452/1315182476.get$Lambda(Unknown Source) at java.lang.invoke.LambdaForm$DMH/1023714065.invokeStatic_LL_L(LambdaForm$DMH) at java.lang.invoke.LambdaForm$MH/1802598046.linkToTargetMethod(LambdaForm$MH) at com.github.propi.rdfrules.algorithm.amie.RuleFilter$RulePatternFilter.matchAtom(RuleFilter.scala:83) at com.github.propi.rdfrules.algorithm.amie.RuleFilter$RulePatternFilter.apply(RuleFilter.scala:95) at com.github.propi.rdfrules.algorithm.amie.RuleFilter$And.apply(RuleFilter.scala:42) at com.github.propi.rdfrules.algorithm.amie.RuleRefinement.$anonfun$refine$13(RuleRefinement.scala:203) at com.github.propi.rdfrules.algorithm.amie.RuleRefinement$$Lambda$1592/1828223227.apply(Unknown Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:448) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:501) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:447) at scala.collection.Iterator.foreach(Iterator.scala:929) at scala.collection.Iterator.foreach$(Iterator.scala:929) at scala.collection.AbstractIterator.foreach(Iterator.scala:1417) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1.$anonfun$run$11(Amie.scala:209) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1.$anonfun$run$11$adapted(Amie.scala:202) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1$$Lambda$1422/153380730.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:929) at scala.collection.Iterator.foreach$(Iterator.scala:929) at scala.collection.AbstractIterator.foreach(Iterator.scala:1417) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1.run(Amie.scala:202) at java.lang.Thread.run(Thread.java:748) 2020-09-16 11:35:34:200 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16665 (0.04 per sec) -- processed rules, found closed rules: 25538702, queue size: 25580374, stage: 2, activeThreads: 6 2020-09-16 11:36:24:157 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16666 (0.06 per sec) -- processed rules, found closed rules: 25544096, queue size: 25585771, stage: 2, activeThreads: 6 Exception in thread "Thread-44" java.lang.OutOfMemoryError: GC overhead limit exceeded 2020-09-16 11:37:33:113 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16667 (0.06 per sec) -- processed rules, found closed rules: 25546573, queue size: 25588249, stage: 2, activeThreads: 6 Exception in thread "Thread-43" java.lang.OutOfMemoryError: GC overhead limit exceeded 2020-09-16 11:38:00:114 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16668 (0.06 per sec) -- processed rules, found closed rules: 25547331, queue size: 25589007, stage: 2, activeThreads: 6 Exception in thread "Thread-41" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-45" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-33" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-53" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-68" java.lang.OutOfMemoryError: GC overhead limit exceeded Uncaught error from thread [rdfrules-http-scheduler-1]: GC overhead limit ex

Add graph-based atoms/rules and constraints

p(a, b, Dbpedia) ->p(a, b, Yago)
p(a, b, [Dbpedia, Wikidata]) ->p(a, b, Yago)

add constraint which enables this behaviour. Default dont use graph-based rules.
rule pattern for graphs is working only if the graph-based mode is turned on.
print rule, a parameter for showing graphs in rules

constants at the subject position
constants at the object position
constants at the functional item (C hasCitizen ?a), or (?a isCitizenOf C) - we instantiate object for the functions and subject for inversed-functions because these items should have greater support.
without constants