Coder Social home page Coder Social logo

derwenai / kglab Goto Github PK

View Code? Open in Web Editor NEW
565.0 20.0 64.0 33.35 MB

Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.

Home Page: https://derwen.ai/docs/kgl/

License: MIT License

Python 10.74% Jupyter Notebook 59.57% Shell 0.03% Dockerfile 0.19% Makefile 0.01% HTML 29.44% Ruby 0.02%
knowledge-graph rdflib networkx skos parquet graph-algorithms pyvis json-ld sparql shacl

kglab's Introduction

kglab

DOI Licence Repo size GitHub commit activity Checked with mypy security: bandit CI downloads sponsor

Welcome to Graph Data Science: https://derwen.ai/docs/kgl/

The kglab library provides a simple abstraction layer in Python 3.7+ for building knowledge graphs, leveraging Pandas, NetworkX, RAPIDS, RDFLib, Morph-KGC, pythonPSL, and many more.

SPECIAL REQUEST:
Which features would you like in an open source Python library for building knowledge graphs?
Please add your suggestions through this survey:
https://forms.gle/FMHgtmxHYWocprMn6
This will help us prioritize the kglab roadmap.

Reviews

@kaaloo:

"Feels like it's a Hugging Face for graphs! 🤯"

Getting Started

See the "Getting Started" section of the online documentation.

Using kglab as a library for your Python project

We recommend installing from PyPi or conda:

pip

python3 -m pip install kglab

pipenv

pipenv install kglab

conda

conda env create -n kglab
conda activate kglab
pip install kglab

Or, install from source:

If you work directly from this Git repo, be sure to install the dependencies:

pip

python3 -m pip install -U pip wheel
python3 -m pip install -r requirements.txt

pipenv

pipenv install --dev

Alternatively, to install dependencies using conda:

conda env create -f environment.yml --force
conda activate kglab

Sample Code

Then to run some simple uses of this library:

import kglab

# create a KnowledgeGraph object
kg = kglab.KnowledgeGraph()

# load RDF from a URL
kg.load_rdf("http://bigasterisk.com/foaf.rdf", format="xml")

# measure the graph
measure = kglab.Measure()
measure.measure_graph(kg)

print("edges: {}\n".format(measure.get_edge_count()))
print("nodes: {}\n".format(measure.get_node_count()))

# serialize as a string in "Turtle" TTL format
ttl = kg.save_rdf_text()
print(ttl)

See the tutorial notebooks in the examples subdirectory for sample code and patterns to use in integrating kglab with other graph libraries in Python: https://derwen.ai/docs/kgl/tutorial/

WARNING when installing in an existing environment:
Installing a new package in an existing environment may reveal
or create version conflicts. See the kglab requirements
in requirements.txt before you do. For example, there are
known version conflicts regarding NumPy (>= 1.19.4) and TensorFlow 2+ (~-1.19.2)

Using Docker

For a simple approach to running the tutorials, see use of docker compose: https://derwen.ai/docs/kgl/tutorial/#use-docker-compose

Also, container images for each release are available on DockerHub: https://hub.docker.com/repository/docker/derwenai/kglab

To build a container image and run it for the tutorials:

docker build --pull --rm -f "docker/Dockerfile" -t kglab:latest .
docker run -p 8888:8888 -it kglab

To build and run a container image for testing:

docker build --pull --rm -f "docker/testsuite.Dockerfile" -t kglabtest:latest .
docker run --rm -it kglabtest
Build Instructions Note: unless you are contributing code and updates, in most use cases won't need to build this package locally.

Instead, simply install from PyPi or use Conda.

To set up the build environment locally, see the "Build Instructions" section of the online documentation.

Semantic Versioning

Before kglab reaches release v1.0.0 the types and classes may undergo substantial changes and the project is not guaranteed to have a consistent API.

Even so, we'll try to minimize breaking changes. We'll also be sure to provide careful notes.

See: changelog.txt

Contributing Code

We welcome people getting involved as contributors to this open source project!

For detailed instructions please see: CONTRIBUTING.md

License and Copyright

Source code for kglab plus its logo, documentation, and examples have an MIT license which is succinct and simplifies use in commercial applications.

All materials herein are Copyright © 2020-2023 Derwen, Inc.

Attribution Please use the following BibTeX entry for citing **kglab** if you use it in your research or software. Citations are helpful for the continued development and maintenance of this library.
@software{kglab,
  author = {Paco Nathan},
  title = {{kglab: a simple abstraction layer in Python for building knowledge graphs}},
  year = 2020,
  publisher = {Derwen},
  doi = {10.5281/zenodo.6360664},
  url = {https://github.com/DerwenAI/kglab}
}

illustration of a knowledge graph, plus laboratory glassware

Kudos

Many thanks to our open source sponsors; and to our contributors: @ceteri, @dvsrepo, @Ankush-Chander, @louisguitton, @tomaarsen, @Mec-iS, @jake-aft, @Tpt, @ArenasGuerreroJulian, @fils, @cutterkom, @RishiKumarRay, @gauravjaglan, @pebbie, @CatChenal, @jorisSchaller, @dmoore247, plus general support from Derwen, Inc.; the Knowledge Graph Conference and Connected Data World; plus an even larger scope of use cases represented by their communities; Kubuntu Focus, the RAPIDS team @ NVIDIA, Gradient Flow, and Manning Publications.

kglab contributors

Star History

Star History Chart

kglab's People

Contributors

ankush-chander avatar arenas-guerrero-julian avatar ceteri avatar cutterkom avatar dependabot[bot] avatar dmoore247 avatar fils avatar gauravjaglan avatar jake-aft avatar jorisschaller avatar louisguitton avatar mec-is avatar pebbie avatar rishikumarray avatar snyk-bot avatar tomaarsen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kglab's Issues

return all namespaces

Note that get_ns_dict() only returns the prefix:uri pairs that have been added explicitly to a kglab.KnowledgeGraph object instances.

It should be returning the union of the namespaces referenced by each of the parses, loads, and explicit namespace additions.

Here's example code to fix the method's returned value:

kg = kglab.KnowledgeGraph()
kg.load_rdf("dat/wtm.ttl", format="ttl")

nm = kg.rdf_graph().namespace_manager
ns = {
    prefix: uri
    for prefix, uri in nm.namespaces()
}

return ns

unsatisfied requirement grave

Documenting here, this was covered in the lab this week:

package grave is imported by kglab.py but not specified in requirements.txt
w/o pip install grave I received error:
image

@ceteri stated intent was to remove grave.

SPARQL Query Visualization

Hi, Thank you for this nice work!

I am suggesting adding the ability to visualize the SPARQL query (the graph pattern) to help in the tutorials.
I wrote this small library a year ago.
I notice that kglab uses pyvis so I refactor the code a bit today so it can be used in pyvis.
This is available in the example notebook.

Cheers

NVidia RAPIDS integration

Provide integrations for NVidia RAPIDS to accelerate graph-based methods with kglab.

Depends on: #37

  • determine how to switch library dependencies and usage internally within kglab based on GPU availability
  • cuDF to accelerate usage for pandas, PyArrow, etc.
  • cuGraph to accelerate usage for networkx, igraph, etc.
  • cuML to accelerate usage for UMAP, PyTorch, scikit-learn, etc.
  • cuFilter to accelerate graph visualization methods (add Bokeh ?)
  • ask @kingmesal, et al., for suggestions and help in promotion
  • work with @BlazingDB to develop and test the RAPIDS integration

Online documentation

Leverage MkDocs and mknotebooks to generate a documentation site in Markdown suitable for interoperating with GitHub Pages

  1. write the main pages in Markdown to describe "How" and "Why" to use kglab for common KG use cases
  2. draw from material in the notebooks to illustrate computable content
  3. parse the generated content to build a sub-site delivered in Flask on https://derwen.ai/

Implement `Subgraph` class

Whenever networkx gets used in kglab one required step in the data preparation is to assign an integer node_id for each node used in the graph algorithm. Often these are subgraphs of the KnowledgeGraph object, not the complete RDF graph, due to some filtering based on SPARQL queries, graph traversals, etc.

Also, the subgraph needed by networkx must be simplified: only one edge between each pair of nodes.
We must choose which relations get used to represent edges in our subgraph, and sometime we need to filter out particular predicates. For example, including rdf:type in a centrality calculation may skew its results.

Other library integrations have similar needs, including pyvis visualizations, pslpython inference, and looking ahead so do many of the libraries for knowledge graph embedding implementations.

Consequently, let's introduce another class called Subgraph which manages translating from RDF graph to a simplified directed graph which uses numeric identifiers. This will help to package results from networkx and other libraries, such as automatically translating node_id values back to nodes objects.

Features include:

  • filter predicates to construct a subgraph
  • build a map of unique identifiers (integers) to nodes within the subgraph
  • map node_id to node and vice versa (codec)
  • construct a DiGraph object
  • package results into iterators returning namedtuple

UML support

Provide support for working with UML diagrams as graphs, e.g., applying inference, shape constraints, probabilistic rules, etc.

These are the approximate steps to follow:

  1. Identify a reasonably small UML diagram to use as test case.
  2. Scope/Discovery: in what ways are UML diagrams represented commonly?
  1. Reuse a Python library or write new code to parse UML
  2. As this stage, @ceteri can assist on RDF semantic representation
  3. Add an import_uml method to KnowledgeGraph

Shape Prediction: integrate ShapeFactory with RLlib to evolve shapes

To implement shape prediction, integrate ShapeFactory with RLlib to evolve shapes.

To do list:

  1. ShapeFactory: leaderboard with non-dominated columns
    * leaderboard_actor based on https://docs.ray.io/en/master/actors.html

  2. Simplex1: create probabilistic generators from dyad census
    * gamma dist to generate indexed links (depth)
    * predicate co-occurrence (approx. triad census) to generate indexed links (breadth)

  3. EvoShape: represent nodes of internal evolved shape
    * randomize order of node list for potential EvoShape targets
    * walk node list to determine if action blocked
    * action method addNodeDepth() using subj gamma dist
    * action method addNodeBreadth() using ∪(pred co-occur ∩ subj gamma dist)
    * action method submit()
    * explore possible use of node2vec to guide addNodeBreadth() ?

  4. gym-evoshape: subclass Gym environment
    * RL observation: action blocked, dist, rank_metric
    * threshold for MIN_INSTANCES => done
    * RL action space: EvoShape methods, where allowed
    * RL reward structure: (neg) steps, (pos) rank_metric
    * visualize subgraph

  5. calculate random baseline of GymKG to generate graph shapes (shape prediction)

  6. adapt for RLlib training with PPO

Import GraphML format

Import files in GraphML format:

  • integrate with pygraphml to import GraphML files
  • use kglab extensions for Parquet serialization and custom memory store to accommodate property graph data

Depends: on #38 #37 #44

AttributeError: module 'kglab' has no attribute 'Subgraph'

Working through some of the tutorials and finding problems:

subgraph = kglab.Subgraph(kg)
AttributeError: module 'kglab' has no attribute 'Subgraph'

and

measure = kglab.Measure()
AttributeError: module 'kglab' has no attribute 'Measure'

Windows 10, Python 3.8.3 (64 bit), numpy==1.19.0

KGE and GNN - support and examples

Integration with Graph DBs

Provide integration paths for popular graph databases.
Based on our user survey, the priorities are:

  • neo4j
  • DataStax
  • Ontotext
  • etc.

APIs SubgraphTensor SubgraphMatrix etc referenced in examples not available in latest pip install

kglab.Subgraph existts but not kglab.SubgraphTensor required in some examples
Same is true of some other APIs
Could someone advise on how to move forward
thanks!

AttributeError Traceback (most recent call last)
in
14 }
15
---> 16 subgraph = kglab.SubgraphTensor(kg)
17 pyvis_graph = subgraph.build_pyvis_graph(style=VIS_STYLE)
18

AttributeError: module 'kglab' has no attribute 'SubgraphTensor'

Courseware on Manning with GPU accelerator support

Extend the courseware with potential support from BlazingSQL and plus NVIDIA credits.

Depends on: #31 #32 #35

  • Work with Manning to have liveProjects run selected notebooks on BlazingSQL platform
  • Work with NVIDIA to get credits for select learners who are nearing completion of their capstone project and require more substantial compute resources

directions for RDBMS support in general

Following up after some work with Trino authors yesterday, there are needs ahead for better metadata modeling based on inference techniques, semantic technologies, etc., for example in Iceberg connectors.

Just found this about Morph-RDB :

Probably a good thing to have a spike toward: how to integrate with Trino and Morph-RDB. Will check with @dachafra, et al.

Could be good to discuss with @dvsrepo and Asun about this, too? It could become an integration point for Recognai?

NetworkX shape passed value error and how can I help?

@ceteri
Paco,
So I have some time to spend working with some schema.org based data from Hydroshare and exploring using kglabs to explore it. I'm having issues applying it and I hope in resolving them I might be able to help somehow in the docs and such.

Hopefully this isn't just me being stupid in graph space, but is of some help back to the project. Happy to share.

Working with the same data from Issue 24 and now trying the NetworkX area I got a specific error.

So this code:

import networkx as nx

sparql3 = """
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?subject ?object
WHERE { 
  ?subject a <https://schema.org/Dataset> .
  ?subject <https://schema.org/creator> ?creator .
  ?creator rdf:first ?o .
  ?o <https://schema.org/name> ?object
}
  """

subgraph = kglab.SubgraphMatrix(kg, sparql3)
nx_graph = subgraph.build_nx_graph(nx.DiGraph(), bipartite=True)

results in this error

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_3293779/2456468873.py in <module>
     13 
     14 subgraph = kglab.SubgraphMatrix(kg, sparql3)
---> 15 nx_graph = subgraph.build_nx_graph(nx.DiGraph(), bipartite=True)

~/.conda/envs/kglab/lib/python3.8/site-packages/kglab/subg.py in build_nx_graph(self, nx_graph, bipartite)
    250         """
    251         if self.kg.use_gpus:
--> 252             df = self.build_df()
    253             nx_graph.from_cudf_edgelist(df, source="src", destination="dst")
    254         else:

~/.conda/envs/kglab/lib/python3.8/site-packages/kglab/subg.py in build_df(self, show_symbols)
    223 
    224         if self.kg.use_gpus:
--> 225             df = cudf.DataFrame(rows_list, columns=col_names)
    226         else:
    227             df = pd.DataFrame(rows_list, columns=col_names)

~/.conda/envs/kglab/lib/python3.8/contextlib.py in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner
     77 

~/.conda/envs/kglab/lib/python3.8/site-packages/cudf/core/dataframe.py in __init__(self, data, index, columns, dtype)
    257                     )
    258                 else:
--> 259                     self._init_from_list_like(
    260                         data, index=index, columns=columns
    261                     )

~/.conda/envs/kglab/lib/python3.8/site-packages/cudf/core/dataframe.py in _init_from_list_like(self, data, index, columns)
    397         if columns is not None:
    398             if len(columns) != len(data):
--> 399                 raise ValueError(
    400                     f"Shape of passed values is ({len(index)}, {len(data)}), "
    401                     f"indices imply ({len(index)}, {len(columns)})."

ValueError: Shape of passed values is (5293, 5293), indices imply (5293, 2).

The results of that SPARQL on the graph should be like:

subject,object
https://www.hydroshare.org/resource/aefabd0a6d7d47ebaa32e2fb293c9f8a#schemaorg,Courtney G Flint
https://www.hydroshare.org/resource/f94ac7f8d8a048cdbd2610dfa7cd315b#schemaorg,Zhiyu (Drew) Li
https://www.hydroshare.org/resource/f9a75c0b289649aa844e84c24f9f5780#schemaorg,Young-Don Choi
https://www.hydroshare.org/resource/173875a936f14c22a5ba19c721adfb86#schemaorg,Remi Dupas
https://www.hydroshare.org/resource/f1116211202a4c069919797272023e62#schemaorg,Nathan Swain
https://www.hydroshare.org/resource/6d80e4bd00244b5dabaff34074cd3102#schemaorg,Garrick Stephenson
https://www.hydroshare.org/resource/25133b13a1fc4fca9187c2d4e272d4e8#schemaorg,Jessie Myers
https://www.hydroshare.org/resource/ca0f2f0f28ba40018ae64b973e2bb35a#schemaorg,Ruth B. MacNeille
https://doi.org/10.4211/hs.88454dae8c604009b684bfa136e5f7f4#schemaorg,Celray James CHAWANDA
https://doi.org/10.4211/hs.1c6034be6886412ba59970ab1157fa7e#schemaorg,Bethany Neilson
for 5293 lines

Courseware on Manning for KGs

Develop courseware for kglab based on Manning liveProject:

Depends on: #32

  • reuse the example notebooks, adding better structure for progressive exercises
  • learners progress toward a capstone project
    • compete on a leaderboard to achieve coverage of semantic tagging for the full recipe dataset
    • points for correctness, coverage, and performance

Questions:

  • Do we need to approach Food.com to partner in promotion?

Visualization with UMAP

Extend the recipes tutorial to illustrate use of UMAP and HDBSCAN based on 250K recipes x 14K ingredients

Error in installation

python 3.7
OS: Ubuntu 18.0.4
Not sure, if related to gcc version.

Building igraph...
    make  all-recursive
    make[1]: Entering directory '/tmp/pip-build-c1_k87ni/python-igraph/vendor/build/igraph'
    Making all in src
    make[2]: Entering directory '/tmp/pip-build-c1_k87ni/python-igraph/vendor/build/igraph/src'
      YACC     foreign-ncol-parser.c
    ../../../source/igraph/ylwrap: line 176: yacc: command not found
    Makefile:9045: recipe for target 'foreign-ncol-parser.c' failed
    make[2]: *** [foreign-ncol-parser.c] Error 127
    make[2]: Leaving directory '/tmp/pip-build-c1_k87ni/python-igraph/vendor/build/igraph/src'
    Makefile:497: recipe for target 'all-recursive' failed
    make[1]: *** [all-recursive] Error 1
    make[1]: Leaving directory '/tmp/pip-build-c1_k87ni/python-igraph/vendor/build/igraph'
    Makefile:404: recipe for target 'all' failed
    make: *** [all] Error 2
    Could not compile the C core of igraph.


    ----------------------------------------
Command "/venv/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-c1_k87ni/python-igraph/setup.py';f=getattr(tokenize, 'open', open)(__file__);c
ode=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-8b7gbk46-record/install-record.txt --single-version-externally-managed --compile --install-headers
 ./venv/include/site/python3.7/python-igraph" failed with error code 1 in /tmp/pip-build-c1_k87ni/python-igraph/

Extend support for Parquet

Extend support for Parquet

Depends on: #37

  • add custom metadata
  • support N-Quads format by default
  • provide for an optional certainty value (extend N-Quads)
  • provide for an optional property value (extend N-Quads)
  • coordinate with @jake-aft, @dmoore247

Support for property values will assist on integration with graph database frameworks such as DataStax, neo4j, etc.,
as well as leading toward support for future implementations of RDF*.

Custom memory store

Develop a custom store for kglab based on the Memory store plugin:

  • future support for extensions to N-Quads based on Parquet
  • future support distributed stores (sharding) based on Ray
  • enhanced use of GPU hardware accelerators based on RAPIDS

Gotcha: numpy version conflict if installing in existing environment with tensorflow 2.4.1

Problem:
When installing kglab using pip in an existing (activated) environment, the latest version of numpy is installed (because requirements.txt includes 'numpy >= 1.19.4'). This may create conflicts with other packages.

Specific Case: latest numpy version and tensorflow 2.4.1 version conflict:
My activated env contains tensorflow 2.4.1.
Near the end of the installation process from pip install kglab, I got this error message (abbreviated):

[...]
Installing collected packages: 
[...], kglab
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.2
    Uninstalling numpy-1.19.2:
      Successfully uninstalled numpy-1.19.2
**ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
  tensorflow 2.4.1 requires numpy~=1.19.2, but you have numpy 1.20.2 which is incompatible.**
Successfully installed [all needed]

My fix:

  1. pip uninstall numpy
  2. pip install numpy==1.19.4

My (minimal) tests:

Suggestion/Question:
Perhaps changing the numpy requirement from 'numpy >= 1.19.4' to 'numpy == 1.19.4' would force pip to install this first compatible version instead of the latest?

Integration with rdflib-sqlalchemy, oxrdflib

Perhaps this is a Google Colab-only issue?

!pip install rdflib-sqlalchemy
import kglab
from rdflib import plugin, Graph, Literal, URIRef
from rdflib.store import Store
store = plugin.get("SQLAlchemy", Store)(identifier=URIRef("rdflib_test"))
graph = Graph(store)
graph.open(Literal("sqlite://"), create=True)
kg = kglab.KnowledgeGraph(
  name = "...",
  import_graph = graph
)

AttributeError Traceback (most recent call last)

in ()
---> 50 import_graph = graph
51 )

3 frames

/content/gdrive/MyDrive/ONR/kglab/kglab/kglab.py in init(self, name, base_uri, language, use_gpus, import_graph, namespaces)
111
112 # import relations from another RDF graph, or start from blank
--> 113 if import_graph:
114 self._g = import_graph
115 else:

/usr/local/lib/python3.7/dist-packages/rdflib/graph.py in len(self)
527 return 1
528
--> 529 def eq(self, other):
530 return isinstance(other, Graph) and self.identifier == other.identifier
531

/usr/local/lib/python3.7/dist-packages/rdflib_sqlalchemy/store.py in len(self, context)
205 (literal, literalContext,
206 ASSERTED_LITERAL_PARTITION), ]
--> 207 q = union_select(selects, distinct=True, select_type=COUNT_SELECT)
208 else:
209 selects = [

/usr/local/lib/python3.7/dist-packages/rdflib_sqlalchemy/sql.py in union_select(select_components, distinct, select_type)
54
55 if select_type == COUNT_SELECT:
---> 56 select_clause = table.count(whereClause)
57 elif select_type == CONTEXT_SELECT:
58 select_clause = expression.select([table.c.context], whereClause)

AttributeError: 'Alias' object has no attribute 'count'

And oxrdflib:

!pip install oxrdflib
import kglab
import rdflib
kg = kglab.KnowledgeGraph(
  name = "...",
  import_graph = rdflib.Graph(store="OxMemory")
)

ModuleNotFoundError Traceback (most recent call last)

in ()
---> 49 import_graph = rdflib.Graph(store="OxMemory")
51 )

3 frames

/content/gdrive/MyDrive/ONR/kglab/kglab/kglab.py in init(self, name, base_uri, language, use_gpus, import_graph, namespaces)
114 self._g = import_graph
115 else:
--> 116 self._g = rdflib.Graph()
117
118 # initialize the namespaces

/usr/local/lib/python3.7/dist-packages/rdflib/graph.py in init(self, store, identifier, namespace_manager, base)
325 if self.__namespace_manager is None:
326 self.__namespace_manager = NamespaceManager(self)
--> 327 return self.__namespace_manager
328
329 def _set_namespace_manager(self, nm):

/usr/local/lib/python3.7/dist-packages/rdflib/plugin.py in get(name, kind)
108
109
--> 110 try:
111 from pkg_resources import iter_entry_points
112 except ImportError:

/usr/local/lib/python3.7/dist-packages/rdflib/plugin.py in getClass(self)
69 module = import(self.module_path, globals(), locals(), [""])
70 self._class = getattr(module, self.class_name)
---> 71 return self._class
72
73

ModuleNotFoundError: No module named 'rdflib.plugins.stores.memory'

load_parquet slow for kg with 200k nodes

Very nice library!

Just exploring a little and noticed that load_parquet seems to be hanging when loading from a saved parquet file. At least, it's taking a lot longer to read the kg from file than it did to create the original kg. While it takes 2 minutes to generate the kg from a csv (kg.add(...)), it's taking over 15 minutes to load the file and appears to be hanging? Any ideas?

The parquet file is ~9MB, and the kg has 200k nodes and 4 Literal relations per node.

The code to load the file is:

kg2 = kglab.KnowledgeGraph(
  name = "...",
  base_uri = "/ex/",
  namespaces = {
    'sosa': 'http://www.w3.org/ns/sosa/'
  },
)
import time
t0 = time.time()
kg2.load_parquet('kg.parquet')
print('Read time: {}s'.format(round((time.time() - t0), 2)))
measure = kglab.Measure()
measure.measure_graph(kg)
print("edges", measure.get_edge_count())
print("nodes", measure.get_node_count())
# edges 1018040
# nodes 203609

Replace GPUtil with pynvml

Two concerns with GPUtil that have lots of impact on cloud-based use cases and our GPU use in general:

  • @ChuckNoelke reported that the GPUtil check in was causing exceptions when Siemens used kglab on AWS SageMaker.
  • Similarly, kglab testing ran into issues with GPUtil when testing GPU support on Ubuntu. The @kfocus team kindly provided a quick workaround and is developing a better Debian package to resolve this. Thanks @mmikowski!

@ray-project was using gpustat which in practice is buggy, and is more oriented toward CLI use to admin a cluster. Probably not wise to use for a library.

Admittedly, NVML changes frequently and keeping the Py bindings to up-to-date is a hard problem. NVidia recommended that our project should use pynvml instead. Thanks @kingmesal!

Will follow-up with @ellisonbg and others at AWS about how we can handle better integration and support with AWS SageMaker and AWS Neptune.

Discuss graph sizes

Discuss at an early point within the sequence of notebooks whether this is merely a "trivial" example or not.

  • dataset provides a simple example, leading up to ~250K recipes for a non-trivial graph
  • the medium-sized has ~200 nodes
    • large enough for examples to illustrate common use cases
    • small enough for learners to understand
  • NB: any graph with over 1K elements can be non-trivial
  • 10^6 or more needed for deep learning
  • 10^8 can run fine on contemporary laptops without hardware accelerators or cloud-based clusters

Extend serialization to allow URLs, cloud storage, streams, etc.

Extend the serialization methods to allow URLs, cloud storage, streams, etc.:

  • RDFlib methods require a string/bytes source, a file path (as str), or a URL (as str)
    • trap URLpath instances and render full-path URL
    • could buffer streams as needed?
  • Parquet methods should be smart enough to handle any available IO that's supported by pyarrow

access anonymous / public AWS S3 object

With dask I can do

df = dd.read_parquet('s3://bucket/key', storage_options={'anon': True})

and it will work for a public bucket / object on AWS S3

trying

kg = kglab.KnowledgeGraph()

kg.load_parquet('s3://bucket/key', storage_options={'anon': True})

returns: NoCredentialsError: Unable to locate credentials

curious what the way to pass the anon True credentials is.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.