derwenai / kglab Goto Github PK

Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.

Home Page: https://derwen.ai/docs/kgl/

License: MIT License

Python 10.74% Jupyter Notebook 59.57% Shell 0.03% Dockerfile 0.19% Makefile 0.01% HTML 29.44% Ruby 0.02%

knowledge-graph rdflib networkx skos parquet graph-algorithms pyvis json-ld sparql shacl

kglab's Introduction

kglab

Welcome to Graph Data Science: https://derwen.ai/docs/kgl/

The kglab library provides a simple abstraction layer in Python 3.7+ for building knowledge graphs, leveraging Pandas, NetworkX, RAPIDS, RDFLib, Morph-KGC, pythonPSL, and many more.

SPECIAL REQUEST:
Which features would you like in an open source Python library for building knowledge graphs?
Please add your suggestions through this survey:
https://forms.gle/FMHgtmxHYWocprMn6
This will help us prioritize the kglab roadmap.

Reviews

@kaaloo:

"Feels like it's a Hugging Face for graphs! 🤯"

Getting Started

See the "Getting Started" section of the online documentation.

Using `kglab` as a library for your Python project

We recommend installing from PyPi or conda:

`pip`

python3 -m pip install kglab

`pipenv`

pipenv install kglab

`conda`

conda env create -n kglab
conda activate kglab
pip install kglab

Or, install from source:

If you work directly from this Git repo, be sure to install the dependencies:

pip

python3 -m pip install -U pip wheel
python3 -m pip install -r requirements.txt

pipenv

pipenv install --dev

Alternatively, to install dependencies using conda:

conda env create -f environment.yml --force
conda activate kglab

Sample Code

Then to run some simple uses of this library:

import kglab

# create a KnowledgeGraph object
kg = kglab.KnowledgeGraph()

# load RDF from a URL
kg.load_rdf("http://bigasterisk.com/foaf.rdf", format="xml")

# measure the graph
measure = kglab.Measure()
measure.measure_graph(kg)

print("edges: {}\n".format(measure.get_edge_count()))
print("nodes: {}\n".format(measure.get_node_count()))

# serialize as a string in "Turtle" TTL format
ttl = kg.save_rdf_text()
print(ttl)

See the tutorial notebooks in the examples subdirectory for sample code and patterns to use in integrating kglab with other graph libraries in Python: https://derwen.ai/docs/kgl/tutorial/

WARNING when installing in an existing environment:
Installing a new package in an existing environment may reveal
or create version conflicts. See the kglab requirements
in requirements.txt before you do. For example, there are
known version conflicts regarding NumPy (>= 1.19.4) and TensorFlow 2+ (~-1.19.2)

Using Docker

For a simple approach to running the tutorials, see use of docker compose: https://derwen.ai/docs/kgl/tutorial/#use-docker-compose

Also, container images for each release are available on DockerHub: https://hub.docker.com/repository/docker/derwenai/kglab

To build a container image and run it for the tutorials:

docker build --pull --rm -f "docker/Dockerfile" -t kglab:latest .
docker run -p 8888:8888 -it kglab

To build and run a container image for testing:

docker build --pull --rm -f "docker/testsuite.Dockerfile" -t kglabtest:latest .
docker run --rm -it kglabtest

Build Instructions

Note: unless you are contributing code and updates, in most use cases won't need to build this package locally.

Instead, simply install from PyPi or use Conda.

To set up the build environment locally, see the "Build Instructions" section of the online documentation.

Semantic Versioning

Before kglab reaches release v1.0.0 the types and classes may undergo substantial changes and the project is not guaranteed to have a consistent API.

Even so, we'll try to minimize breaking changes. We'll also be sure to provide careful notes.

See: changelog.txt

Contributing Code

We welcome people getting involved as contributors to this open source project!

For detailed instructions please see: CONTRIBUTING.md

License and Copyright

Source code for kglab plus its logo, documentation, and examples have an MIT license which is succinct and simplifies use in commercial applications.

Attribution

Please use the following BibTeX entry for citing **kglab** if you use it in your research or software. Citations are helpful for the continued development and maintenance of this library.

@software{kglab,
  author = {Paco Nathan},
  title = {{kglab: a simple abstraction layer in Python for building knowledge graphs}},
  year = 2020,
  publisher = {Derwen},
  doi = {10.5281/zenodo.6360664},
  url = {https://github.com/DerwenAI/kglab}
}

Kudos

Many thanks to our open source sponsors; and to our contributors: @ceteri, @dvsrepo, @Ankush-Chander, @louisguitton, @tomaarsen, @Mec-iS, @jake-aft, @Tpt, @ArenasGuerreroJulian, @fils, @cutterkom, @RishiKumarRay, @gauravjaglan, @pebbie, @CatChenal, @jorisSchaller, @dmoore247, plus general support from Derwen, Inc.; the Knowledge Graph Conference and Connected Data World; plus an even larger scope of use cases represented by their communities; Kubuntu Focus, the RAPIDS team @ NVIDIA, Gradient Flow, and Manning Publications.

Star History

kglab's People

Contributors

Stargazers

Watchers

Forkers

jake-aft dmoore247 adytiaa louisguitton shalevy1 jingmouren trendingtechnology gauravjaglan sircanist ankush-chander thethreers sigvemartin ashepherd vijayakaushik drahnreb pebbie tomaarsen standardgalactic loudbrain aarek-eng vzkqwvku akatie oylumalatli codeaudit mec-is rogervaas ceperezegma nischalhp fils aahmadai rattling pawan2411 manikant92 giuliocmsanto shekerkamma techthiyanes katreparitosh alripa yeniherdiyeni rishikumarray ryanassaad cutterkom georgiakp shinyzhu hassogba alexfang0214sh wesmadrigal acremeneanu vishalbelsare shandou ssrisunt funnyskillz febikammck amdens amdens-sci keyilizzie furmanlukasz akuckartz zahidabasher jorisschaller mistertean rjgladish

kglab's Issues

return all namespaces

Note that get_ns_dict() only returns the prefix:uri pairs that have been added explicitly to a kglab.KnowledgeGraph object instances.

It should be returning the union of the namespaces referenced by each of the parses, loads, and explicit namespace additions.

Here's example code to fix the method's returned value:

kg = kglab.KnowledgeGraph()
kg.load_rdf("dat/wtm.ttl", format="ttl")

nm = kg.rdf_graph().namespace_manager
ns = {
    prefix: uri
    for prefix, uri in nm.namespaces()
}

return ns

unsatisfied requirement grave

Documenting here, this was covered in the lab this week:

package grave is imported by kglab.py but not specified in requirements.txt
w/o pip install grave I received error:

@ceteri stated intent was to remove grave.

SPARQL Query Visualization

Hi, Thank you for this nice work!

I am suggesting adding the ability to visualize the SPARQL query (the graph pattern) to help in the tutorials.
I wrote this small library a year ago.
I notice that kglab uses pyvis so I refactor the code a bit today so it can be used in pyvis.
This is available in the example notebook.

Cheers

Colab incompatibility with rdflib-jsonld

Google Colab appears to have an incompatibility with rdflib-jsonld, apparently due to how rdflib plugins behave (which is odd and arguably non-Pythonic).

We should be able to swap out that dependency with an integration of https://github.com/digitalbazaar/pyld

Question requirements.txt does not include jupyter or jupyter lab

Happy Monday Paco,

The requirements.txt does not include jupyter or jupyter-lab is that intentional?

add SHACL support + examples

add SHACL support + examples through
https://github.com/RDFLib/pySHACL

NVidia RAPIDS integration

Provide integrations for NVidia RAPIDS to accelerate graph-based methods with kglab.

Depends on: #37

determine how to switch library dependencies and usage internally within kglab based on GPU availability
cuDF to accelerate usage for pandas, PyArrow, etc.
cuGraph to accelerate usage for networkx, igraph, etc.
cuML to accelerate usage for UMAP, PyTorch, scikit-learn, etc.
cuFilter to accelerate graph visualization methods (add Bokeh ?)
ask @kingmesal, et al., for suggestions and help in promotion
work with @BlazingDB to develop and test the RAPIDS integration

De/Ser RDF from text

Add serialization and deserialization from strings to kglab methods.

Community structure question

So I am trying to replicate: https://derwen.ai/docs/kgl/ex6_1/

Using
https://s3.amazonaws.com/ufokn.public/hydroshare_graph.nt

With
https://gist.github.com/fils/3b0edcc33dc37a0290b44ac0d0f58fd9

and yet... I fail. Any guidance on what I might be doing wrong or why my source graph doesn't work for this is appreciated!

Online documentation

Leverage MkDocs and mknotebooks to generate a documentation site in Markdown suitable for interoperating with GitHub Pages

write the main pages in Markdown to describe "How" and "Why" to use kglab for common KG use cases
draw from material in the notebooks to illustrate computable content
parse the generated content to build a sub-site delivered in Flask on https://derwen.ai/

Implement `Subgraph` class

Whenever networkx gets used in kglab one required step in the data preparation is to assign an integer node_id for each node used in the graph algorithm. Often these are subgraphs of the KnowledgeGraph object, not the complete RDF graph, due to some filtering based on SPARQL queries, graph traversals, etc.

Also, the subgraph needed by networkx must be simplified: only one edge between each pair of nodes.
We must choose which relations get used to represent edges in our subgraph, and sometime we need to filter out particular predicates. For example, including rdf:type in a centrality calculation may skew its results.

Other library integrations have similar needs, including pyvis visualizations, pslpython inference, and looking ahead so do many of the libraries for knowledge graph embedding implementations.

Consequently, let's introduce another class called Subgraph which manages translating from RDF graph to a simplified directed graph which uses numeric identifiers. This will help to package results from networkx and other libraries, such as automatically translating node_id values back to nodes objects.

Features include:

filter predicates to construct a subgraph
build a map of unique identifiers (integers) to nodes within the subgraph
map node_id to node and vice versa (codec)
construct a DiGraph object
package results into iterators returning namedtuple

UML support

Provide support for working with UML diagrams as graphs, e.g., applying inference, shape constraints, probabilistic rules, etc.

These are the approximate steps to follow:

Identify a reasonably small UML diagram to use as test case.
Scope/Discovery: in what ways are UML diagrams represented commonly?

Graphviz DOT files, e.g.,
https://github.com/abulka/pynsource
https://github.com/gklingler/CodeDependencyVisualizer

Reuse a Python library or write new code to parse UML
As this stage, @ceteri can assist on RDF semantic representation
Add an import_uml method to KnowledgeGraph

Shape Prediction: integrate ShapeFactory with RLlib to evolve shapes

To implement shape prediction, integrate ShapeFactory with RLlib to evolve shapes.

To do list:

ShapeFactory: leaderboard with non-dominated columns
* leaderboard_actor based on https://docs.ray.io/en/master/actors.html
Simplex1: create probabilistic generators from dyad census
* gamma dist to generate indexed links (depth)
* predicate co-occurrence (approx. triad census) to generate indexed links (breadth)
EvoShape: represent nodes of internal evolved shape
* randomize order of node list for potential EvoShape targets
* walk node list to determine if action blocked
* action method addNodeDepth() using subj gamma dist
* action method addNodeBreadth() using ∪(pred co-occur ∩ subj gamma dist)
* action method submit()
* explore possible use of node2vec to guide addNodeBreadth() ?
gym-evoshape: subclass Gym environment
* RL observation: action blocked, dist, rank_metric
* threshold for MIN_INSTANCES => done
* RL action space: EvoShape methods, where allowed
* RL reward structure: (neg) steps, (pos) rank_metric
* visualize subgraph
calculate random baseline of GymKG to generate graph shapes (shape prediction)
adapt for RLlib training with PPO

Integrate with Apache Spark based on Parquet serialization

Depends on: #38

allow for out-of-memory graph traversal, as an option for large horizontal scale-out of SPARQL, SHARCL, etc., with the kglab custom store plugin
coordinate with @dmoore247

Import GraphML format

Import files in GraphML format:

integrate with pygraphml to import GraphML files
use kglab extensions for Parquet serialization and custom memory store to accommodate property graph data

Depends: on #38 #37 #44

add docker-compose

It could really help to define a Docker Compose definition file for the kglab tutorial running JupyterLab:
https://docs.docker.com/compose/

Then people could launch this directly, without having to work about installation.

AttributeError: module 'kglab' has no attribute 'Subgraph'

Working through some of the tutorials and finding problems:

subgraph = kglab.Subgraph(kg)
AttributeError: module 'kglab' has no attribute 'Subgraph'

and

measure = kglab.Measure()
AttributeError: module 'kglab' has no attribute 'Measure'

Windows 10, Python 3.8.3 (64 bit), numpy==1.19.0

Integrate with Snorkel for weak supervision use cases

Integrate with Snorkel for weak supervision use cases

add a notebook to show use of weak supervision as a complement to PSL, KGE, etc.
enhance the semantic tagging of the 14K ingredients in the full dataset for the recipe KG

Integrate with PyTextRank for entity linking

Use a graph defined by kglab to enhance entity extraction in PyTextRank

add support specifically for entity linking
coordinate with @louisguitton
submit kglab to spaCy Universe

Inference and Measure within notebook examples

Add a notebook to show use of inference based on SKOS and OWL.

Then use the Measure class to illustrate how inference augments a graph.

use rdfpandas

Use rdfpands https://pypi.org/project/rdfpandas/ for SPARQL query results

(thanks Daniel)

KGE and GNN - support and examples

Add support for knowledge graph embedding (KGE), to build on the UMAP example, based on DeepWalk, node2vec, etc., used to handle the semantic tagging of the full recipe dataset.

Depends on: #29

Tutorials to evaluate:

Code repos to evaluate:

Papers to recommend:

"Machine Learning on Graphs: A Model and Comprehensive Taxonomy"; Ines Chami, et al. (2020-05-07)
PapersWithCode
"Semantic Preserving Embeddings for Generalized Graphs"

Integration with Graph DBs

Provide integration paths for popular graph databases.
Based on our user survey, the priorities are:

neo4j
DataStax
Ontotext
etc.

Add support for python-igraph

Extend the Subgraph class to add support for python-igraph.
Add a notebook to show example uses of igraph in comparison with networkx.
Integrate use of leidenalg based on igraph support.

ask @colinmegill for example use cases

APIs SubgraphTensor SubgraphMatrix etc referenced in examples not available in latest pip install

kglab.Subgraph existts but not kglab.SubgraphTensor required in some examples
Same is true of some other APIs
Could someone advise on how to move forward
thanks!

AttributeError Traceback (most recent call last)
in
14 }
15
---> 16 subgraph = kglab.SubgraphTensor(kg)
17 pyvis_graph = subgraph.build_pyvis_graph(style=VIS_STYLE)
18

AttributeError: module 'kglab' has no attribute 'SubgraphTensor'

Courseware on Manning with GPU accelerator support

Extend the courseware with potential support from BlazingSQL and plus NVIDIA credits.

Depends on: #31 #32 #35

Work with Manning to have liveProjects run selected notebooks on BlazingSQL platform
Work with NVIDIA to get credits for select learners who are nearing completion of their capstone project and require more substantial compute resources

directions for RDBMS support in general

Following up after some work with Trino authors yesterday, there are needs ahead for better metadata modeling based on inference techniques, semantic technologies, etc., for example in Iceberg connectors.

Just found this about Morph-RDB :

Probably a good thing to have a spike toward: how to integrate with Trino and Morph-RDB. Will check with @dachafra, et al.

Could be good to discuss with @dvsrepo and Asun about this, too? It could become an integration point for Recognai?

Klgab on Anaconda throws cuDf exception

The version installed via pip caused this error on ubuntu 20.04 but the error does not occur when the repo is installed via git.

Let's show some examples of integration with `kgextension`

Let's show some examples of integration with kgextension
https://kgextension.readthedocs.io/en/latest/

Could be another notebook added to the tutorial.

Where it fits, we might also integrate as a dependency?

Integrate `pgmpy` for Bayesian networks capabilities

Integrated pgmpy for statistical inference in Bayesian networks.

Depends on: #26

NetworkX shape passed value error and how can I help?

@ceteri
Paco,
So I have some time to spend working with some schema.org based data from Hydroshare and exploring using kglabs to explore it. I'm having issues applying it and I hope in resolving them I might be able to help somehow in the docs and such.

Hopefully this isn't just me being stupid in graph space, but is of some help back to the project. Happy to share.

Working with the same data from Issue 24 and now trying the NetworkX area I got a specific error.

So this code:

import networkx as nx

sparql3 = """
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?subject ?object
WHERE { 
  ?subject a <https://schema.org/Dataset> .
  ?subject <https://schema.org/creator> ?creator .
  ?creator rdf:first ?o .
  ?o <https://schema.org/name> ?object
}
  """

subgraph = kglab.SubgraphMatrix(kg, sparql3)
nx_graph = subgraph.build_nx_graph(nx.DiGraph(), bipartite=True)

results in this error

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_3293779/2456468873.py in <module>
     13 
     14 subgraph = kglab.SubgraphMatrix(kg, sparql3)
---> 15 nx_graph = subgraph.build_nx_graph(nx.DiGraph(), bipartite=True)

~/.conda/envs/kglab/lib/python3.8/site-packages/kglab/subg.py in build_nx_graph(self, nx_graph, bipartite)
    250         """
    251         if self.kg.use_gpus:
--> 252             df = self.build_df()
    253             nx_graph.from_cudf_edgelist(df, source="src", destination="dst")
    254         else:

~/.conda/envs/kglab/lib/python3.8/site-packages/kglab/subg.py in build_df(self, show_symbols)
    223 
    224         if self.kg.use_gpus:
--> 225             df = cudf.DataFrame(rows_list, columns=col_names)
    226         else:
    227             df = pd.DataFrame(rows_list, columns=col_names)

~/.conda/envs/kglab/lib/python3.8/contextlib.py in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner
     77 

~/.conda/envs/kglab/lib/python3.8/site-packages/cudf/core/dataframe.py in __init__(self, data, index, columns, dtype)
    257                     )
    258                 else:
--> 259                     self._init_from_list_like(
    260                         data, index=index, columns=columns
    261                     )

~/.conda/envs/kglab/lib/python3.8/site-packages/cudf/core/dataframe.py in _init_from_list_like(self, data, index, columns)
    397         if columns is not None:
    398             if len(columns) != len(data):
--> 399                 raise ValueError(
    400                     f"Shape of passed values is ({len(index)}, {len(data)}), "
    401                     f"indices imply ({len(index)}, {len(columns)})."

ValueError: Shape of passed values is (5293, 5293), indices imply (5293, 2).

The results of that SPARQL on the graph should be like:

subject,object
https://www.hydroshare.org/resource/aefabd0a6d7d47ebaa32e2fb293c9f8a#schemaorg,Courtney G Flint
https://www.hydroshare.org/resource/f94ac7f8d8a048cdbd2610dfa7cd315b#schemaorg,Zhiyu (Drew) Li
https://www.hydroshare.org/resource/f9a75c0b289649aa844e84c24f9f5780#schemaorg,Young-Don Choi
https://www.hydroshare.org/resource/173875a936f14c22a5ba19c721adfb86#schemaorg,Remi Dupas
https://www.hydroshare.org/resource/f1116211202a4c069919797272023e62#schemaorg,Nathan Swain
https://www.hydroshare.org/resource/6d80e4bd00244b5dabaff34074cd3102#schemaorg,Garrick Stephenson
https://www.hydroshare.org/resource/25133b13a1fc4fca9187c2d4e272d4e8#schemaorg,Jessie Myers
https://www.hydroshare.org/resource/ca0f2f0f28ba40018ae64b973e2bb35a#schemaorg,Ruth B. MacNeille
https://doi.org/10.4211/hs.88454dae8c604009b684bfa136e5f7f4#schemaorg,Celray James CHAWANDA
https://doi.org/10.4211/hs.1c6034be6886412ba59970ab1157fa7e#schemaorg,Bethany Neilson
for 5293 lines

Courseware on Manning for KGs

Develop courseware for kglab based on Manning liveProject:

Depends on: #32

reuse the example notebooks, adding better structure for progressive exercises
learners progress toward a capstone project
- compete on a leaderboard to achieve coverage of semantic tagging for the full recipe dataset
- points for correctness, coverage, and performance

Questions:

Do we need to approach Food.com to partner in promotion?

Visualization with UMAP

Extend the recipes tutorial to illustrate use of UMAP and HDBSCAN based on 250K recipes x 14K ingredients

https://umap-learn.readthedocs.io/en/latest/basic_usage.html
https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html
discuss persistent homology and simplicial complexes based on "How UMAP Works"
drive the learner toward ways of semantic tagging of the raw ingredient labels, to develop a consistent controlled vocabulary to replace the simply cases that FoodOn covers
lead into preparing data for the node2vec embedding examples

Error in installation

python 3.7
OS: Ubuntu 18.0.4
Not sure, if related to gcc version.

Building igraph...
    make  all-recursive
    make[1]: Entering directory '/tmp/pip-build-c1_k87ni/python-igraph/vendor/build/igraph'
    Making all in src
    make[2]: Entering directory '/tmp/pip-build-c1_k87ni/python-igraph/vendor/build/igraph/src'
      YACC     foreign-ncol-parser.c
    ../../../source/igraph/ylwrap: line 176: yacc: command not found
    Makefile:9045: recipe for target 'foreign-ncol-parser.c' failed
    make[2]: *** [foreign-ncol-parser.c] Error 127
    make[2]: Leaving directory '/tmp/pip-build-c1_k87ni/python-igraph/vendor/build/igraph/src'
    Makefile:497: recipe for target 'all-recursive' failed
    make[1]: *** [all-recursive] Error 1
    make[1]: Leaving directory '/tmp/pip-build-c1_k87ni/python-igraph/vendor/build/igraph'
    Makefile:404: recipe for target 'all' failed
    make: *** [all] Error 2
    Could not compile the C core of igraph.


    ----------------------------------------
Command "/venv/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-c1_k87ni/python-igraph/setup.py';f=getattr(tokenize, 'open', open)(__file__);c
ode=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-8b7gbk46-record/install-record.txt --single-version-externally-managed --compile --install-headers
 ./venv/include/site/python3.7/python-igraph" failed with error code 1 in /tmp/pip-build-c1_k87ni/python-igraph/

Integrate SPARQLWrapper

Integrate SPARQLWrapper https://sparqlwrapper.readthedocs.io/en/stable/ along with the serialization methods, as a convenient means of consuming from a

Currently it's a bit clunky to use, which may required digging into its internals to intercept the RDF triples before they get serialized as JSON-LD.

Kudos to @igrangel

Extend support for Parquet

Depends on: #37

add custom metadata
support N-Quads format by default
provide for an optional certainty value (extend N-Quads)
provide for an optional property value (extend N-Quads)
coordinate with @jake-aft, @dmoore247

Support for property values will assist on integration with graph database frameworks such as DataStax, neo4j, etc.,
as well as leading toward support for future implementations of RDF*.

Custom memory store

Develop a custom store for kglab based on the Memory store plugin:

future support for extensions to N-Quads based on Parquet
future support distributed stores (sharding) based on Ray
enhanced use of GPU hardware accelerators based on RAPIDS

Gotcha: numpy version conflict if installing in existing environment with tensorflow 2.4.1

Problem:
When installing kglab using pip in an existing (activated) environment, the latest version of numpy is installed (because requirements.txt includes 'numpy >= 1.19.4'). This may create conflicts with other packages.

Specific Case: latest numpy version and tensorflow 2.4.1 version conflict:
My activated env contains tensorflow 2.4.1.
Near the end of the installation process from pip install kglab, I got this error message (abbreviated):

[...]
Installing collected packages: 
[...], kglab
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.2
    Uninstalling numpy-1.19.2:
      Successfully uninstalled numpy-1.19.2
**ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
  tensorflow 2.4.1 requires numpy~=1.19.2, but you have numpy 1.20.2 which is incompatible.**
Successfully installed [all needed]

My fix:

pip uninstall numpy
pip install numpy==1.19.4

My (minimal) tests:

kglab: ran "Sample Usage" code in https://derwen.ai/docs/kgl/start/ without error.
tensorflow: ran code to "Verify the install" on https://www.tensorflow.org/install/pip.html without error.

Suggestion/Question:
Perhaps changing the numpy requirement from 'numpy >= 1.19.4' to 'numpy == 1.19.4' would force pip to install this first compatible version instead of the latest?

Integrate with Ray for a distributed store

Integrate with Ray to support for a distributed store

Depends on: #38 #39

leverage use of Ray actors, remote objects, parallel iterators, etc. to implement sharding for distributing store
coordinate with @jake-aft, @torito1984

Integration with rdflib-sqlalchemy, oxrdflib

Perhaps this is a Google Colab-only issue?

!pip install rdflib-sqlalchemy
import kglab
from rdflib import plugin, Graph, Literal, URIRef
from rdflib.store import Store
store = plugin.get("SQLAlchemy", Store)(identifier=URIRef("rdflib_test"))
graph = Graph(store)
graph.open(Literal("sqlite://"), create=True)
kg = kglab.KnowledgeGraph(
  name = "...",
  import_graph = graph
)

AttributeError Traceback (most recent call last)

in ()
---> 50 import_graph = graph
51 )

3 frames

/content/gdrive/MyDrive/ONR/kglab/kglab/kglab.py in init(self, name, base_uri, language, use_gpus, import_graph, namespaces)
111
112 # import relations from another RDF graph, or start from blank
--> 113 if import_graph:
114 self._g = import_graph
115 else:

/usr/local/lib/python3.7/dist-packages/rdflib/graph.py in len(self)
527 return 1
528
--> 529 def eq(self, other):
530 return isinstance(other, Graph) and self.identifier == other.identifier
531

/usr/local/lib/python3.7/dist-packages/rdflib_sqlalchemy/store.py in len(self, context)
205 (literal, literalContext,
206 ASSERTED_LITERAL_PARTITION), ]
--> 207 q = union_select(selects, distinct=True, select_type=COUNT_SELECT)
208 else:
209 selects = [

/usr/local/lib/python3.7/dist-packages/rdflib_sqlalchemy/sql.py in union_select(select_components, distinct, select_type)
54
55 if select_type == COUNT_SELECT:
---> 56 select_clause = table.count(whereClause)
57 elif select_type == CONTEXT_SELECT:
58 select_clause = expression.select([table.c.context], whereClause)

AttributeError: 'Alias' object has no attribute 'count'

And oxrdflib:

!pip install oxrdflib
import kglab
import rdflib
kg = kglab.KnowledgeGraph(
  name = "...",
  import_graph = rdflib.Graph(store="OxMemory")
)

ModuleNotFoundError Traceback (most recent call last)

in ()
---> 49 import_graph = rdflib.Graph(store="OxMemory")
51 )

3 frames

/content/gdrive/MyDrive/ONR/kglab/kglab/kglab.py in init(self, name, base_uri, language, use_gpus, import_graph, namespaces)
114 self._g = import_graph
115 else:
--> 116 self._g = rdflib.Graph()
117
118 # initialize the namespaces

/usr/local/lib/python3.7/dist-packages/rdflib/graph.py in init(self, store, identifier, namespace_manager, base)
325 if self.__namespace_manager is None:
326 self.__namespace_manager = NamespaceManager(self)
--> 327 return self.__namespace_manager
328
329 def _set_namespace_manager(self, nm):

/usr/local/lib/python3.7/dist-packages/rdflib/plugin.py in get(name, kind)
108
109
--> 110 try:
111 from pkg_resources import iter_entry_points
112 except ImportError:

/usr/local/lib/python3.7/dist-packages/rdflib/plugin.py in getClass(self)
69 module = import(self.module_path, globals(), locals(), [""])
70 self._class = getattr(module, self.class_name)
---> 71 return self._class
72
73

ModuleNotFoundError: No module named 'rdflib.plugins.stores.memory'

load_parquet slow for kg with 200k nodes

Very nice library!

Just exploring a little and noticed that load_parquet seems to be hanging when loading from a saved parquet file. At least, it's taking a lot longer to read the kg from file than it did to create the original kg. While it takes 2 minutes to generate the kg from a csv (kg.add(...)), it's taking over 15 minutes to load the file and appears to be hanging? Any ideas?

The parquet file is ~9MB, and the kg has 200k nodes and 4 Literal relations per node.

The code to load the file is:

kg2 = kglab.KnowledgeGraph(
  name = "...",
  base_uri = "/ex/",
  namespaces = {
    'sosa': 'http://www.w3.org/ns/sosa/'
  },
)
import time
t0 = time.time()
kg2.load_parquet('kg.parquet')
print('Read time: {}s'.format(round((time.time() - t0), 2)))

measure = kglab.Measure()
measure.measure_graph(kg)
print("edges", measure.get_edge_count())
print("nodes", measure.get_node_count())
# edges 1018040
# nodes 203609

Replace GPUtil with pynvml

Two concerns with GPUtil that have lots of impact on cloud-based use cases and our GPU use in general:

@ChuckNoelke reported that the GPUtil check in was causing exceptions when Siemens used kglab on AWS SageMaker.
Similarly, kglab testing ran into issues with GPUtil when testing GPU support on Ubuntu. The @kfocus team kindly provided a quick workaround and is developing a better Debian package to resolve this. Thanks @mmikowski!

@ray-project was using gpustat which in practice is buggy, and is more oriented toward CLI use to admin a cluster. Probably not wise to use for a library.

Admittedly, NVML changes frequently and keeping the Py bindings to up-to-date is a hard problem. NVidia recommended that our project should use pynvml instead. Thanks @kingmesal!

Will follow-up with @ellisonbg and others at AWS about how we can handle better integration and support with AWS SageMaker and AWS Neptune.

Integrate with graph-tool

Integrate with graph-tool:

add a graph-tool example to the notebooks, based on kglab.Subgraph, which shows
* advanced visualization
* broader range of graph algorithms
* parallelization based on OpenMP
export RDF graphs to GraphML format using pygraphml and rdf_to_property_graph

Integrate csvwlib to support W3C CSVW

Add support for W3C CSV on the Web (CSWV).

integrate use of csvwlib
validate with CSVW Test suites

dataset provides a simple example, leading up to ~250K recipes for a non-trivial graph
the medium-sized has ~200 nodes
- large enough for examples to illustrate common use cases
- small enough for learners to understand
NB: any graph with over 1K elements can be non-trivial
10^6 or more needed for deep learning
10^8 can run fine on contemporary laptops without hardware accelerators or cloud-based clusters

add a multifile decorator for the serialization methods

Add a @multifile decorator for handling paths with wildcards, in each of the serialization methods.

For a great example, see https://github.com/koaning/clumper/blob/master/clumper/decorators.py

Extend serialization to allow URLs, cloud storage, streams, etc.

Extend the serialization methods to allow URLs, cloud storage, streams, etc.:

RDFlib methods require a string/bytes source, a file path (as str), or a URL (as str)
- trap URLpath instances and render full-path URL
- could buffer streams as needed?
Parquet methods should be smart enough to handle any available IO that's supported by pyarrow
- URL-based through URLpath
- multi-cloud buckets through Pathy
- various forms of stream I/O

access anonymous / public AWS S3 object

With dask I can do

df = dd.read_parquet('s3://bucket/key', storage_options={'anon': True})

and it will work for a public bucket / object on AWS S3

trying

kg = kglab.KnowledgeGraph()

kg.load_parquet('s3://bucket/key', storage_options={'anon': True})

returns: NoCredentialsError: Unable to locate credentials

curious what the way to pass the anon True credentials is.

derwenai / kglab Goto Github PK

kglab's Introduction

kglab

Reviews

Getting Started

Using kglab as a library for your Python project

pip

pipenv

conda

Or, install from source:

pip

pipenv

Sample Code

Kudos

Star History

kglab's People

Contributors

Stargazers

Watchers

Forkers

kglab's Issues

Recommend Projects

Recommend Topics

Recommend Org

Using `kglab` as a library for your Python project

`pip`

`pipenv`

`conda`