Coder Social home page Coder Social logo

propi / rdfrules Goto Github PK

View Code? Open in Web Editor NEW
27.0 27.0 2.0 277.63 MB

RDFRules: Analytical Tool for Rule Mining from RDF Knowledge Graphs

License: GNU General Public License v3.0

Scala 98.73% HTML 0.04% JavaScript 0.06% Less 1.17%
data-analysis data-mining data-science dataset gui knowledge-base knowledge-based-systems knowledge-discovery knowledge-graph postprocessing preprocessing rdf rest-api rule-engine rule-mining rules rules-engine scala semantic-web

rdfrules's People

Contributors

propi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

kizi sandy4321

rdfrules's Issues

Deserialization exception - Invalid type of measure

If json-serialized rules contain confidence or some other measure, they cannot be deserialized via Load ruleset due to Deserialization exception - Invalid type of measure.

rules.json
[ { "body": [ { "object": { "type": "variable", "value": "?a" }, "predicate": "<interacts_with>", "subject": { "type": "variable", "value": "?b" } } ], "head": { "object": { "type": "variable", "value": "?b" }, "predicate": "<interacts_with>", "subject": { "type": "variable", "value": "?a" } }, "measures": [ { "name": "BodySize", "value": 11702212 }, { "name": "HeadCoverage", "value": 0.9917442958647477 }, { "name": "Support", "value": 11605602 }, { "name": "HeadSize", "value": 11702212 }, { "name": "Confidence", "value": 0.9917442958647477 } ] } ]

Add indicative progress indicator

It would help if there was support for approximate progress indicator for the mine task (number of rules processed + possibly estimate based on the time required to process rules so far)

State default values of parameters in GUI Mine node

Some thresholds in the Mine node are effective also when not present - defaults apply.
This, e.g., affects "Min head size", which has a default of 100. The default values in effect should be communicated to the user.

In GUI cache into memory

We need to resolve the lifetime (or idle time) of an index in the memory. Or set a limit in setting: max idle time for index.

  • save an index
  • remove the index

Schema support

It should be possible to attach a schema at a dataset. Then we can do some extended operations:

  • generate triples with types from ontology (domain, range)
  • working with sub/super types; add a new triples with super types

Instantiation does not work properly

Sometimes in results there are different predicates. Wrong task:

[
  {
    "name": "LoadGraph",
    "parameters": {
      "path": "/dbpedia_yago/mappingbased_objects_sample.ttl",
      "graphName": "<dbpedia>"
    }
  },
  {
    "name": "LoadGraph",
    "parameters": {
      "path": "/dbpedia_yago/yagoFacts.tsv",
      "graphName": "<yago>"
    }
  },
  {
    "name": "LoadGraph",
    "parameters": {
      "path": "/dbpedia_yago/yagoDBpediaInstances.tsv",
      "graphName": "<dbpedia>"
    }
  },
  {
    "name": "MergeDatasets",
    "parameters": {}
  },
  {
    "name": "AddPrefixes",
    "parameters": {
      "prefixes": [
        {
          "prefix": "dbo",
          "nameSpace": "http://dbpedia.org/ontology/"
        },
        {
          "prefix": "dbr",
          "nameSpace": "http://dbpedia.org/resource/"
        }
      ]
    }
  },
  {
    "name": "Index",
    "parameters": {
      "prefixedUris": true
    }
  },
  {
    "name": "Mine",
    "parameters": {
      "thresholds": [
        {
          "name": "TopK",
          "value": 1000
        },
        {
          "name": "MinHeadCoverage",
          "value": 0.01
        }
      ],
      "patterns": [],
      "constraints": [
        {
          "name": "WithoutConstants"
        }
      ]
    }
  },
  {
    "name": "CacheRuleset",
    "parameters": {
      "inMemory": true,
      "path": "e4790ffb-d535-4e14-9478-867d3f4abe2a",
      "revalidate": false
    }
  },
  {
    "name": "ComputePcaConfidence",
    "parameters": {
      "min": 0.5,
      "topk": 50
    }
  },
  {
    "name": "Sorted",
    "parameters": {}
  },
  {
    "name": "GraphBasedRules",
    "parameters": {}
  },
  {
    "name": "Instantiate",
    "parameters": {
      "rule": {
        "body": [
          {
            "graphs": [
              "<dbpedia>"
            ],
            "object": {
              "type": "variable",
              "value": "?c"
            },
            "predicate": {
              "localName": "album",
              "nameSpace": "http://dbpedia.org/ontology/",
              "prefix": "dbo"
            },
            "subject": {
              "type": "variable",
              "value": "?a"
            }
          },
          {
            "graphs": [
              "<yago>"
            ],
            "object": {
              "type": "variable",
              "value": "?c"
            },
            "predicate": "<created>",
            "subject": {
              "type": "variable",
              "value": "?b"
            }
          }
        ],
        "head": {
          "graphs": [
            "<dbpedia>"
          ],
          "object": {
            "type": "variable",
            "value": "?b"
          },
          "predicate": {
            "localName": "musicalBand",
            "nameSpace": "http://dbpedia.org/ontology/",
            "prefix": "dbo"
          },
          "subject": {
            "type": "variable",
            "value": "?a"
          }
        },
        "measures": [
          {
            "name": "HeadCoverage",
            "value": 0.4664823773324119
          },
          {
            "name": "HeadSize",
            "value": 2894
          },
          {
            "name": "PcaBodySize",
            "value": 1368
          },
          {
            "name": "Support",
            "value": 1350
          },
          {
            "name": "PcaConfidence",
            "value": 0.9868421052631579
          }
        ]
      },
      "part": "Whole"
    }
  },
  {
    "name": "GetRules",
    "parameters": {}
  }
]

In GUI and REST add revalidate checkbox

By default the revalidate checkbox will be unchecked and once the cache is used within the workflow the next usage will be loaded from the cache and all prepended operations will be omitted. If the revalidate is checked it performs all previous operations and creates the cache again.

Set limitations for workspace (during upload)

Set immutable and mutable folders + set temporary restrictions for uploaded file (e.g. max one week)... Set memory limitation for app, and restart http if overflowed memory. Show current state of the memory in GUI...

Better logging

Do not show same message more times. Loading dataset logging is wrong, because it does not take into account more graphs and merging to one dataset. Resolve how to disable loading dataset logging if it is indexing since there are very annoying messages which show same for dataset and index loading:

2020-09-08T16:11:28.606Z : Action Dataset indexing, steps: 0 -- started
2020-09-08T16:11:28.612Z : Action Dataset loading, steps: 0 -- started
2020-09-08T16:11:28.612Z : Action Dataset indexing, steps: 0 -- started
2020-09-08T16:11:30.862Z : Action Dataset loading, steps: 14465 -- ended
2020-09-08T16:11:30.862Z : Action Dataset indexing, steps: 14465
2020-09-08T16:11:30.862Z : Action Dataset loading, steps: 0 -- started
2020-09-08T16:11:30.862Z : Action Dataset indexing, steps: 14465
2020-09-08T16:11:32.434Z : Action Dataset loading, steps: 18845 -- ended
2020-09-08T16:11:32.434Z : Action Dataset indexing, steps: 33310
2020-09-08T16:11:32.435Z : Action Dataset loading, steps: 0 -- started
2020-09-08T16:11:32.435Z : Action Dataset indexing, steps: 33310
2020-09-08T16:11:37.436Z : Action Dataset loading, steps: 20205
2020-09-08T16:11:37.436Z : Action Dataset indexing, steps: 53516
2020-09-08T16:11:42.437Z : Action Dataset loading, steps: 52228
2020-09-08T16:11:42.437Z : Action Dataset indexing, steps: 85539
2020-09-08T16:11:47.438Z : Action Dataset loading, steps: 81463
2020-09-08T16:11:47.438Z : Action Dataset indexing, steps: 114774
2020-09-08T16:11:52.440Z : Action Dataset loading, steps: 112571
2020-09-08T16:11:52.443Z : Action Dataset indexing, steps: 145882
2020-09-08T16:11:53.711Z : Action Dataset loading, steps: 121437 -- ended
2020-09-08T16:11:53.712Z : Action Dataset indexing, steps: 154747
2020-09-08T16:11:53.765Z : Action Dataset indexing, steps: 154747 -- ended
2020-09-08T16:11:53.766Z : Action SameAs resolving, steps: 0 -- started
2020-09-08T16:11:54.195Z : Predicates trimming.
2020-09-08T16:11:54.195Z : Action SameAs resolving, steps: 0 -- ended
2020-09-08T16:11:54.318Z : Action Subjects indexing, steps: 0 -- started
2020-09-08T16:11:54.878Z : Subjects trimming.
2020-09-08T16:11:54.878Z : Action Subjects indexing, steps: 140281 -- ended
2020-09-08T16:11:54.948Z : Action Objects indexing, steps: 0 -- started
2020-09-08T16:11:55.341Z : Objects trimming.
2020-09-08T16:11:55.341Z : Action Objects indexing, steps: 140281 -- ended
2020-09-08T16:11:55.407Z : Action Amie rules mining, steps: 0 -- started
2020-09-08T16:11:55.407Z : Action Browsed projections large buckets, steps: 0 -- started
2020-09-08T16:11:55.407Z : Action Amie rules mining, steps: 0 -- started
2020-09-08T16:12:00.423Z : Action Browsed projections large buckets, steps: 0 -- started
2020-09-08T16:12:00.425Z : Action Amie rules mining, steps: 3500 -- processed rules, found closed rules: 1086, queue size: 7609
2020-09-08T16:12:05.436Z : Action Browsed projections large buckets, steps: 0 -- started
2020-09-08T16:12:05.436Z : Action Amie rules mining, steps: 9053 -- processed rules, found closed rules: 2032, queue size: 1631
2020-09-08T16:12:08.207Z : Action Browsed projections large buckets, steps: 0 -- ended
2020-09-08T16:12:08.207Z : Action Amie rules mining, steps: 10206 -- processed rules, found closed rules: 2242, queue size: 0
2020-09-08T16:12:08.208Z : Action Amie rules mining, steps: 10206 -- ended
2020-09-08T16:12:08.261Z : Action PCA Confidence computing, steps: 0 of 1000, progress: 0.0% -- started
2020-09-08T16:12:09.343Z : Action PCA Confidence computing, steps: 1000 of 1000, progress: 100.0% -- ended

Empty cache file can cause pipeline to fail without log notices

This behaviour can be reproduced

  • load attached task.json.
  • create empty file "rulesPCA" in the workspace
  • run the pipeline

What will happen:
Indexing will not start. The log messages shown are:

2020-09-11 14:55:25:461 +0200 [rdfrules-http-akka.actor.default-dispatcher-6] INFO com.github.propi.rdfrules.http.InMemoryCache - Some value with key '025495e4-8e84-4f30-bfec-325a18dd3499x' was pushed into the memory cache. Number of items in the cache is: 1
2020-09-11 14:56:16:359 +0200 [Thread-1] INFO task-7216ee1d-9a9b-4286-bad1-0425e3c6b6fc - Predicates trimming.
2020-09-11 14:56:16:363 +0200 [rdfrules-http-akka.actor.default-dispatcher-9] INFO akka.actor.LocalActorRef - Message [com.github.propi.rdfrules.http.service.Task$TaskRequest$AddMsg] to Actor[akka://rdfrules-http/user/task-service/task-7216ee1d-9a9b-4286-bad1-0425e3c6b6fc#-1209497236] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://rdfrules-http/user/task-service/task-7216ee1d-9a9b-4286-bad1-0425e3c6b6fc#-1209497236] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

A workaround is either to delete the empty cache file "rulesPCA" or to set the last Cache node in the pipeline to "revalidate" the cache.

task.json.zip

Comparison of the current method with a new proposal

  • we need not specify dangling variable
  • first count support for all dangling atoms without instances
  • if support is greter then threshold count instances for danglings
  • instead of specifyAtom create function specifyVariable

Memory estimation by dataset size and create limits

Estimate memory needed for storing dataset into index. Set limits, e.g., max 1GB = N quads...
There can be some upper limit in combination with System.gc. Once we are closer to the limit then we stop loading the index.

Other restrictions in setting:

  • max quads
  • max triple item size
  • max memory during indexing (or at all)
  • max mined rules

Slow actor debugger

LinkedBlockingQueue is the bottleneck. Try to implement "non-blocking" debugger. One message with counter instead of queue. Thread can sleep just 5 sec and then read the current message.

Mine will return maximum 10.000 rules

It seems that the output of Mine is capped to 10.000 rules (if Top-K is not used). If Top-k is used, any value set to a higher value seems to be automatically redefined to 10.000.

Better mining debugging

Separate debugging to stages and offer progress bar based on the queue size for each stage.

add json task command line support

It would be convenient to be able to run RDFRules, e.g., as

java -jar RDFRulesLauncher.jar "task.json"
Where task.json would be generated in the GUI, or modified based on a task.json generated in the GUI.
This could supersede the Java API.

sync with new SBT version

It seems that SBT is not compatible with jdk 13 and 14.
sbt/sbt#5509 ("We don't test sbt on JDK 14, so that could also be the problem. Please run it on JDK 8 or 11.")
If this is true, the documentation should warn about this.
For me, it works with JDK 11.

Also, the run-main command on RDFRules homepage does not seem to work with current version of SBT - clulab/eidos#440.
It seems it was replaceed by runMain.

Not to involve pruned head triples in the refining phase

One some head triples are not mapped to body (they are pruned), we need not involve them in the next refine phase.

If the A_r set is empty, the current binding of the head s,p,o can be omitted within any other refinements of subsequent rules having the basis of the current rule.

When RDFRules runs out of memory, worker threads are not terminated

When RDFRules runs out of memory (GC overhead limited exceeded), worker threads are not terminated and the load of all CPU cores remains at 100%.

020-09-16 11:34:59:780 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16664 (0.06 per sec) -- processed rules, found closed rules: 25535936, queue size: 25577603, stage: 2, activeThreads: 6 Exception in thread "Thread-40" java.lang.OutOfMemoryError: GC overhead limit exceeded at com.github.propi.rdfrules.algorithm.amie.RuleFilter$RulePatternFilter$$Lambda$1452/1315182476.get$Lambda(Unknown Source) at java.lang.invoke.LambdaForm$DMH/1023714065.invokeStatic_LL_L(LambdaForm$DMH) at java.lang.invoke.LambdaForm$MH/1802598046.linkToTargetMethod(LambdaForm$MH) at com.github.propi.rdfrules.algorithm.amie.RuleFilter$RulePatternFilter.matchAtom(RuleFilter.scala:83) at com.github.propi.rdfrules.algorithm.amie.RuleFilter$RulePatternFilter.apply(RuleFilter.scala:95) at com.github.propi.rdfrules.algorithm.amie.RuleFilter$And.apply(RuleFilter.scala:42) at com.github.propi.rdfrules.algorithm.amie.RuleRefinement.$anonfun$refine$13(RuleRefinement.scala:203) at com.github.propi.rdfrules.algorithm.amie.RuleRefinement$$Lambda$1592/1828223227.apply(Unknown Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:448) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:501) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:447) at scala.collection.Iterator.foreach(Iterator.scala:929) at scala.collection.Iterator.foreach$(Iterator.scala:929) at scala.collection.AbstractIterator.foreach(Iterator.scala:1417) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1.$anonfun$run$11(Amie.scala:209) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1.$anonfun$run$11$adapted(Amie.scala:202) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1$$Lambda$1422/153380730.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:929) at scala.collection.Iterator.foreach$(Iterator.scala:929) at scala.collection.AbstractIterator.foreach(Iterator.scala:1417) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1.run(Amie.scala:202) at java.lang.Thread.run(Thread.java:748) 2020-09-16 11:35:34:200 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16665 (0.04 per sec) -- processed rules, found closed rules: 25538702, queue size: 25580374, stage: 2, activeThreads: 6 2020-09-16 11:36:24:157 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16666 (0.06 per sec) -- processed rules, found closed rules: 25544096, queue size: 25585771, stage: 2, activeThreads: 6 Exception in thread "Thread-44" java.lang.OutOfMemoryError: GC overhead limit exceeded 2020-09-16 11:37:33:113 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16667 (0.06 per sec) -- processed rules, found closed rules: 25546573, queue size: 25588249, stage: 2, activeThreads: 6 Exception in thread "Thread-43" java.lang.OutOfMemoryError: GC overhead limit exceeded 2020-09-16 11:38:00:114 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16668 (0.06 per sec) -- processed rules, found closed rules: 25547331, queue size: 25589007, stage: 2, activeThreads: 6 Exception in thread "Thread-41" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-45" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-33" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-53" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-68" java.lang.OutOfMemoryError: GC overhead limit exceeded Uncaught error from thread [rdfrules-http-scheduler-1]: GC overhead limit ex

Add graph-based atoms/rules and constraints

p(a, b, Dbpedia) ->p(a, b, Yago)
p(a, b, [Dbpedia, Wikidata]) ->p(a, b, Yago)

  • add constraint which enables this behaviour. Default dont use graph-based rules.
  • rule pattern for graphs is working only if the graph-based mode is turned on.
  • print rule, a parameter for showing graphs in rules

Thresholds deleting

Remove thresholds which are not using during mining (confidence, pcaconfidence, etc.).

Constraints enhancements

By default, minining without contraints should mean mining with constants at the subject and object positions. Constraints should be:

  • constants at the subject position
  • constants at the object position
  • constants at the functional item (C hasCitizen ?a), or (?a isCitizenOf C) - we instantiate object for the functions and subject for inversed-functions because these items should have greater support.
  • without constants

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.