propi / rdfrules Goto Github PK
View Code? Open in Web Editor NEWRDFRules: Analytical Tool for Rule Mining from RDF Knowledge Graphs
License: GNU General Public License v3.0
RDFRules: Analytical Tool for Rule Mining from RDF Knowledge Graphs
License: GNU General Public License v3.0
If json-serialized rules contain confidence or some other measure, they cannot be deserialized via Load ruleset due to Deserialization exception - Invalid type of measure.
rules.json
[ { "body": [ { "object": { "type": "variable", "value": "?a" }, "predicate": "<interacts_with>", "subject": { "type": "variable", "value": "?b" } } ], "head": { "object": { "type": "variable", "value": "?b" }, "predicate": "<interacts_with>", "subject": { "type": "variable", "value": "?a" } }, "measures": [ { "name": "BodySize", "value": 11702212 }, { "name": "HeadCoverage", "value": 0.9917442958647477 }, { "name": "Support", "value": 11605602 }, { "name": "HeadSize", "value": 11702212 }, { "name": "Confidence", "value": 0.9917442958647477 } ] } ]
It would help if there was support for approximate progress indicator for the mine task (number of rules processed + possibly estimate based on the time required to process rules so far)
Some thresholds in the Mine node are effective also when not present - defaults apply.
This, e.g., affects "Min head size", which has a default of 100. The default values in effect should be communicated to the user.
This is problem with a default akka response limit if there are lots of logs.
Add hints to parameters and operations
We need to resolve the lifetime (or idle time) of an index in the memory. Or set a limit in setting: max idle time for index.
It should be possible to attach a schema at a dataset. Then we can do some extended operations:
If the ruleset has indexedseq as col, it should not resave the rules into a new collection again during caching. Check it whether it behaves this way.
Sometimes in results there are different predicates. Wrong task:
[
{
"name": "LoadGraph",
"parameters": {
"path": "/dbpedia_yago/mappingbased_objects_sample.ttl",
"graphName": "<dbpedia>"
}
},
{
"name": "LoadGraph",
"parameters": {
"path": "/dbpedia_yago/yagoFacts.tsv",
"graphName": "<yago>"
}
},
{
"name": "LoadGraph",
"parameters": {
"path": "/dbpedia_yago/yagoDBpediaInstances.tsv",
"graphName": "<dbpedia>"
}
},
{
"name": "MergeDatasets",
"parameters": {}
},
{
"name": "AddPrefixes",
"parameters": {
"prefixes": [
{
"prefix": "dbo",
"nameSpace": "http://dbpedia.org/ontology/"
},
{
"prefix": "dbr",
"nameSpace": "http://dbpedia.org/resource/"
}
]
}
},
{
"name": "Index",
"parameters": {
"prefixedUris": true
}
},
{
"name": "Mine",
"parameters": {
"thresholds": [
{
"name": "TopK",
"value": 1000
},
{
"name": "MinHeadCoverage",
"value": 0.01
}
],
"patterns": [],
"constraints": [
{
"name": "WithoutConstants"
}
]
}
},
{
"name": "CacheRuleset",
"parameters": {
"inMemory": true,
"path": "e4790ffb-d535-4e14-9478-867d3f4abe2a",
"revalidate": false
}
},
{
"name": "ComputePcaConfidence",
"parameters": {
"min": 0.5,
"topk": 50
}
},
{
"name": "Sorted",
"parameters": {}
},
{
"name": "GraphBasedRules",
"parameters": {}
},
{
"name": "Instantiate",
"parameters": {
"rule": {
"body": [
{
"graphs": [
"<dbpedia>"
],
"object": {
"type": "variable",
"value": "?c"
},
"predicate": {
"localName": "album",
"nameSpace": "http://dbpedia.org/ontology/",
"prefix": "dbo"
},
"subject": {
"type": "variable",
"value": "?a"
}
},
{
"graphs": [
"<yago>"
],
"object": {
"type": "variable",
"value": "?c"
},
"predicate": "<created>",
"subject": {
"type": "variable",
"value": "?b"
}
}
],
"head": {
"graphs": [
"<dbpedia>"
],
"object": {
"type": "variable",
"value": "?b"
},
"predicate": {
"localName": "musicalBand",
"nameSpace": "http://dbpedia.org/ontology/",
"prefix": "dbo"
},
"subject": {
"type": "variable",
"value": "?a"
}
},
"measures": [
{
"name": "HeadCoverage",
"value": 0.4664823773324119
},
{
"name": "HeadSize",
"value": 2894
},
{
"name": "PcaBodySize",
"value": 1368
},
{
"name": "Support",
"value": 1350
},
{
"name": "PcaConfidence",
"value": 0.9868421052631579
}
]
},
"part": "Whole"
}
},
{
"name": "GetRules",
"parameters": {}
}
]
By default the revalidate checkbox will be unchecked and once the cache is used within the workflow the next usage will be loaded from the cache and all prepended operations will be omitted. If the revalidate is checked it performs all previous operations and creates the cache again.
Set immutable and mutable folders + set temporary restrictions for uploaded file (e.g. max one week)... Set memory limitation for app, and restart http if overflowed memory. Show current state of the memory in GUI...
Now in the result there are all predicted triples including triples which are already placed in the dataset
It is needed if the mining crashes to have some results or for saving memory.
the strategy that avoids outputting rules that do not improve the confidence w.r.t their parents. These rules should be involved in refinement but not on the output. Maybe this functionality should be added into Ruleset
Do not show same message more times. Loading dataset logging is wrong, because it does not take into account more graphs and merging to one dataset. Resolve how to disable loading dataset logging if it is indexing since there are very annoying messages which show same for dataset and index loading:
2020-09-08T16:11:28.606Z : Action Dataset indexing, steps: 0 -- started
2020-09-08T16:11:28.612Z : Action Dataset loading, steps: 0 -- started
2020-09-08T16:11:28.612Z : Action Dataset indexing, steps: 0 -- started
2020-09-08T16:11:30.862Z : Action Dataset loading, steps: 14465 -- ended
2020-09-08T16:11:30.862Z : Action Dataset indexing, steps: 14465
2020-09-08T16:11:30.862Z : Action Dataset loading, steps: 0 -- started
2020-09-08T16:11:30.862Z : Action Dataset indexing, steps: 14465
2020-09-08T16:11:32.434Z : Action Dataset loading, steps: 18845 -- ended
2020-09-08T16:11:32.434Z : Action Dataset indexing, steps: 33310
2020-09-08T16:11:32.435Z : Action Dataset loading, steps: 0 -- started
2020-09-08T16:11:32.435Z : Action Dataset indexing, steps: 33310
2020-09-08T16:11:37.436Z : Action Dataset loading, steps: 20205
2020-09-08T16:11:37.436Z : Action Dataset indexing, steps: 53516
2020-09-08T16:11:42.437Z : Action Dataset loading, steps: 52228
2020-09-08T16:11:42.437Z : Action Dataset indexing, steps: 85539
2020-09-08T16:11:47.438Z : Action Dataset loading, steps: 81463
2020-09-08T16:11:47.438Z : Action Dataset indexing, steps: 114774
2020-09-08T16:11:52.440Z : Action Dataset loading, steps: 112571
2020-09-08T16:11:52.443Z : Action Dataset indexing, steps: 145882
2020-09-08T16:11:53.711Z : Action Dataset loading, steps: 121437 -- ended
2020-09-08T16:11:53.712Z : Action Dataset indexing, steps: 154747
2020-09-08T16:11:53.765Z : Action Dataset indexing, steps: 154747 -- ended
2020-09-08T16:11:53.766Z : Action SameAs resolving, steps: 0 -- started
2020-09-08T16:11:54.195Z : Predicates trimming.
2020-09-08T16:11:54.195Z : Action SameAs resolving, steps: 0 -- ended
2020-09-08T16:11:54.318Z : Action Subjects indexing, steps: 0 -- started
2020-09-08T16:11:54.878Z : Subjects trimming.
2020-09-08T16:11:54.878Z : Action Subjects indexing, steps: 140281 -- ended
2020-09-08T16:11:54.948Z : Action Objects indexing, steps: 0 -- started
2020-09-08T16:11:55.341Z : Objects trimming.
2020-09-08T16:11:55.341Z : Action Objects indexing, steps: 140281 -- ended
2020-09-08T16:11:55.407Z : Action Amie rules mining, steps: 0 -- started
2020-09-08T16:11:55.407Z : Action Browsed projections large buckets, steps: 0 -- started
2020-09-08T16:11:55.407Z : Action Amie rules mining, steps: 0 -- started
2020-09-08T16:12:00.423Z : Action Browsed projections large buckets, steps: 0 -- started
2020-09-08T16:12:00.425Z : Action Amie rules mining, steps: 3500 -- processed rules, found closed rules: 1086, queue size: 7609
2020-09-08T16:12:05.436Z : Action Browsed projections large buckets, steps: 0 -- started
2020-09-08T16:12:05.436Z : Action Amie rules mining, steps: 9053 -- processed rules, found closed rules: 2032, queue size: 1631
2020-09-08T16:12:08.207Z : Action Browsed projections large buckets, steps: 0 -- ended
2020-09-08T16:12:08.207Z : Action Amie rules mining, steps: 10206 -- processed rules, found closed rules: 2242, queue size: 0
2020-09-08T16:12:08.208Z : Action Amie rules mining, steps: 10206 -- ended
2020-09-08T16:12:08.261Z : Action PCA Confidence computing, steps: 0 of 1000, progress: 0.0% -- started
2020-09-08T16:12:09.343Z : Action PCA Confidence computing, steps: 1000 of 1000, progress: 100.0% -- ended
For example: CacheDataset is action or transformation. During loading it is impossible to distinguish action from transformation with same name.
This behaviour can be reproduced
What will happen:
Indexing will not start. The log messages shown are:
2020-09-11 14:55:25:461 +0200 [rdfrules-http-akka.actor.default-dispatcher-6] INFO com.github.propi.rdfrules.http.InMemoryCache - Some value with key '025495e4-8e84-4f30-bfec-325a18dd3499x' was pushed into the memory cache. Number of items in the cache is: 1
2020-09-11 14:56:16:359 +0200 [Thread-1] INFO task-7216ee1d-9a9b-4286-bad1-0425e3c6b6fc - Predicates trimming.
2020-09-11 14:56:16:363 +0200 [rdfrules-http-akka.actor.default-dispatcher-9] INFO akka.actor.LocalActorRef - Message [com.github.propi.rdfrules.http.service.Task$TaskRequest$AddMsg] to Actor[akka://rdfrules-http/user/task-service/task-7216ee1d-9a9b-4286-bad1-0425e3c6b6fc#-1209497236] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://rdfrules-http/user/task-service/task-7216ee1d-9a9b-4286-bad1-0425e3c6b6fc#-1209497236] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
A workaround is either to delete the empty cache file "rulesPCA" or to set the last Cache node in the pipeline to "revalidate" the cache.
In GUI if the workspace is empty it returns javascript error.
useMapper[T](mapper => index => T)
Estimate memory needed for storing dataset into index. Set limits, e.g., max 1GB = N quads...
There can be some upper limit in combination with System.gc. Once we are closer to the limit then we stop loading the index.
Other restrictions in setting:
LinkedBlockingQueue is the bottleneck. Try to implement "non-blocking" debugger. One message with counter instead of queue. Thread can sleep just 5 sec and then read the current message.
It seems that the output of Mine is capped to 10.000 rules (if Top-K is not used). If Top-k is used, any value set to a higher value seems to be automatically redefined to 10.000.
It is not apparently possible to search for rule based on matches in two or more atoms.
Atoms in rules are separated by space, but searching for space works only inside atoms.
The instantiation of "body" and "head" seem to return the same result as the instantiation of the whole rule.
Separate debugging to stages and offer progress bar based on the queue size for each stage.
It would be convenient to be able to run RDFRules, e.g., as
java -jar RDFRulesLauncher.jar "task.json"
Where task.json would be generated in the GUI, or modified based on a task.json generated in the GUI.
This could supersede the Java API.
It seems that SBT is not compatible with jdk 13 and 14.
sbt/sbt#5509 ("We don't test sbt on JDK 14, so that could also be the problem. Please run it on JDK 8 or 11.")
If this is true, the documentation should warn about this.
For me, it works with JDK 11.
Also, the run-main
command on RDFRules homepage does not seem to work with current version of SBT - clulab/eidos#440.
It seems it was replaceed by runMain
.
One some head triples are not mapped to body (they are pruned), we need not involve them in the next refine phase.
If the A_r set is empty, the current binding of the head s,p,o can be omitted within any other refinements of subsequent rules having the basis of the current rule.
When RDFRules runs out of memory (GC overhead limited exceeded), worker threads are not terminated and the load of all CPU cores remains at 100%.
020-09-16 11:34:59:780 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16664 (0.06 per sec) -- processed rules, found closed rules: 25535936, queue size: 25577603, stage: 2, activeThreads: 6 Exception in thread "Thread-40" java.lang.OutOfMemoryError: GC overhead limit exceeded at com.github.propi.rdfrules.algorithm.amie.RuleFilter$RulePatternFilter$$Lambda$1452/1315182476.get$Lambda(Unknown Source) at java.lang.invoke.LambdaForm$DMH/1023714065.invokeStatic_LL_L(LambdaForm$DMH) at java.lang.invoke.LambdaForm$MH/1802598046.linkToTargetMethod(LambdaForm$MH) at com.github.propi.rdfrules.algorithm.amie.RuleFilter$RulePatternFilter.matchAtom(RuleFilter.scala:83) at com.github.propi.rdfrules.algorithm.amie.RuleFilter$RulePatternFilter.apply(RuleFilter.scala:95) at com.github.propi.rdfrules.algorithm.amie.RuleFilter$And.apply(RuleFilter.scala:42) at com.github.propi.rdfrules.algorithm.amie.RuleRefinement.$anonfun$refine$13(RuleRefinement.scala:203) at com.github.propi.rdfrules.algorithm.amie.RuleRefinement$$Lambda$1592/1828223227.apply(Unknown Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:448) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:501) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:447) at scala.collection.Iterator.foreach(Iterator.scala:929) at scala.collection.Iterator.foreach$(Iterator.scala:929) at scala.collection.AbstractIterator.foreach(Iterator.scala:1417) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1.$anonfun$run$11(Amie.scala:209) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1.$anonfun$run$11$adapted(Amie.scala:202) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1$$Lambda$1422/153380730.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:929) at scala.collection.Iterator.foreach$(Iterator.scala:929) at scala.collection.AbstractIterator.foreach(Iterator.scala:1417) at com.github.propi.rdfrules.algorithm.amie.Amie$AmieProcess$$anon$1.run(Amie.scala:202) at java.lang.Thread.run(Thread.java:748) 2020-09-16 11:35:34:200 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16665 (0.04 per sec) -- processed rules, found closed rules: 25538702, queue size: 25580374, stage: 2, activeThreads: 6 2020-09-16 11:36:24:157 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16666 (0.06 per sec) -- processed rules, found closed rules: 25544096, queue size: 25585771, stage: 2, activeThreads: 6 Exception in thread "Thread-44" java.lang.OutOfMemoryError: GC overhead limit exceeded 2020-09-16 11:37:33:113 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16667 (0.06 per sec) -- processed rules, found closed rules: 25546573, queue size: 25588249, stage: 2, activeThreads: 6 Exception in thread "Thread-43" java.lang.OutOfMemoryError: GC overhead limit exceeded 2020-09-16 11:38:00:114 +0200 [Thread-33] INFO task-3ba84b09-de58-4f40-bb60-3d41f2e4062a - Action Amie rules mining, steps: 16668 (0.06 per sec) -- processed rules, found closed rules: 25547331, queue size: 25589007, stage: 2, activeThreads: 6 Exception in thread "Thread-41" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-45" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-33" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-53" java.lang.OutOfMemoryError: GC overhead limit exceeded Exception in thread "Thread-68" java.lang.OutOfMemoryError: GC overhead limit exceeded Uncaught error from thread [rdfrules-http-scheduler-1]: GC overhead limit ex
p(a, b, Dbpedia) ->p(a, b, Yago)
p(a, b, [Dbpedia, Wikidata]) ->p(a, b, Yago)
Remove thresholds which are not using during mining (confidence, pcaconfidence, etc.).
By default, minining without contraints should mean mining with constants at the subject and object positions. Constraints should be:
In HTTP module
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.