metagraph-dev / metagraph Goto Github PK
View Code? Open in Web Editor NEWMulti-target API for graph analytics with Dask
Home Page: https://metagraph.readthedocs.io/en/latest/
License: Apache License 2.0
Multi-target API for graph analytics with Dask
Home Page: https://metagraph.readthedocs.io/en/latest/
License: Apache License 2.0
Thinking about the plugin architecture is making me want a versioned approach to the API, similar to how things are done for a REST API. With REST APIs, novice implementers write something like /item/1/update/
, but experiences authors write /v1/item/1/update/
because they know that their API will evolve over time and they want a way to move forward without breaking the existing users of their REST API.
In a similar vein, we will choose types, properties, and abstract algorithms that plugins will implement. If we ever want to change those types, properties, and algorithm signatures or paths, we don't want to break all existing plugins. Let's think through the potential approaches to take.
If we decide that our types are wrong, it should be easy enough to create new types. We will lose the ability to reuse a name (i.e. Vector) to mean something different, but that's probably okay. The likelihood is that we will introduce new types (BipartiteGraph, MultiGraph, NeuralNetwork) rather than wanting to change an existing type.
I have been thinking that I wanted to eliminate the hierarchy of abstract properties and instead make properties dependent on each other. For example, a Graph is_weighted
or not. If is_weighted=True
, then you can have a strictly_positive
property. Similarly, for a Matrix the is_symmetric
property only applies for a Matrix that is_ square
. There is a hierarchy of properties depending on each other.
For a given property, there still is clearly a direction of narrowness or specificity, as I've discussed before. Going from narrow to general is easy. However, there is often a standard way to going in the other direction as well. Weights are simply dropped, directed edges ignore their direction, etc. As such, we could ignore the directionality of properties.
However, having a concept of general vs narrow settings for a property is nice when it comes to adding a new property to existing implementations. For example, if Graphs today have only properties is_directed
and is_weighted
, and we later add a property has_self_edges
, all existing implementations of Graphs become ill-defined. They don't know about the new property and therefore cannot tell you if they have self-edges or not. If we maintain the concept of general vs narrow settings, then an undefined property is always set to its most general setting. This means that introducing a new property will keep all existing implementations fully defined, but restricted to the most general setting.
For the example of self-loops, having self-loops is the general case while not having self-loops is the narrow case. So we can assume all existing Graph types have self-loops, and slowly upgrade plugins with a knowledge of self-loop determination without breaking the whole ecosystem.
This ability to add new properties in a clean manner seems reason enough to keep the general vs narrow concept of property settings.
Changing an abstract algorithm's signature is dangerous because it will no longer match existing concrete algorithm signatures. We could change the name of the algorithm, but we could end up in a case with sssp
, sssp2
, shortest_path
, and ss_shortest_path
, making it very confusing for a user to know which is the latest and greatest. Plenty of existing libraries already suffer from this condition. They deprecate functions, but cannot remove them for fear of breaking existing code.
We likely need some type of versioning with abstract algorithms tagged against v1, v2, etc. We will want to keep older versions around to avoid breaking existing user code and to avoid eliminating existing plugin implementations. Then we slowly deprecate older version or find a way to steer new user code to use the newer versions.
I'm not sure how to do this without making all user code look like r.algo.v1.link_analysis.pagerank(G)
and r.algo.v2.rankings.google_pagerank(G)
. Anyone have good ideas?
In a similar vein, I like the categorization of abstract algorithms, but those categories might shift over time as the number of algorithms grows. In the ideal world, a user would be able to search for an algorithm in a notebook using a hierarchy or tags or other methods, and then call the algorithm independent of the mechanism used to find the algorithm name.
Something like: r.algo.link_analysis.<tab>
would show that pagerank
is a valid algorithm, but the user would actually write code like r.algo.pagerank
to keep working code independent of the categories we choose.
Using a version approach (r.algo1
, r.algo2
) could handle this as each new version could change the categorization without breaking the previous version's style. I don't love that solution, but it would work. Any other ideas?
When writing unit tests, we will often have a need to check for equality or near-equality (within a tolerance) of two objects.
What is the best approach to achieving this? I'm leaning towards the 2nd approach.
Each test knows what the expected result will be and can perform whatever checks it wants. If some types have an equality method, the test can call that, but there is no expectation for every type to provide such a method.
Each type must provide an equality method. Because we aren't in control of raw types, this check will need to exist on the ConcreteType.
When designing the API, @jim22k and I ran into problems where different backends might support the same algorithm with a different set of parameters.
For example, igraph's betweenness centrality takes an explicit set of nodes to perform betweenness centrality on as doing so on the whole graph might be prohibitively expensive. NetworkX's betweenness centrality takes an number k of nodes to randomly sample. cugraph's betweenness centrality has the same interface (but unfortunately raises an exception when k is passed in as it does not yet support randomly sampling k nodes since cudf does not, which in practice means that cugraph only supports betweenness centrality on the whole graph).
The current approach we're going with for designing the API is a "rigid" approach where we choose an interface for each algorithm that's somewhat fixed, easy to use, and is friendly to several backends. For the betweenness centrality example above, this approach works since we could easily choose a more general interface for our API, which would be one similar to that of igraph.
This might be sufficient for now for the problems we've come across so far. Two concerns associated with using the rigid approach going forward include:
One proposal for solving this is to use what I'm going to call the "flexible" approach.
The flexible approach would have it such that each function has a set of parameters where a few are required (e.g. you must specify a graph) and many are optional where some of the optional parameters might be mutually exclusive.
For example, the betweenness centrality interface might accept a number k of nodes to sample or a set of nodes to use, but not both.
Having this set of optional parameters makes our API more easily extendable to support new backends with different interfaces.
For the betweenness centrality case, someone could specify k, a set of nodes, or neither.
NetworkX
(since cugraph's lack of support for the k parameter effectively removes k from its interface).That is the basic idea. The parts below discuss the downsides.
The rigid approach would require the metagraph resolver figure out only how to translate a fixed set of parameters to the same fixed set of parameters of a possibly different types. The flexible approach would require the metagraph resolver to figure out (A) which backend interfaces are applicable first and then (B) the type translations. Figuring out (A) might be non-trivial. For example if k for betweenness centrality is specified, we could "translate" it to igraph's backend by sampling k nodes ourselves. This process might get more complicated as we proceed. This additionally has the challenge that determining the fastest translation route would be more challenging.
The flexible approach is almost suggesting to make our interface for the algorithm F the union of all the interfaces of all the backends' interfaces for F. This will lead to many many parameters, which will make our API very confusing and hard to comprehend or use.
One thing to keep in mind that is that our goal isn't necessarily to support every backend or every possible interface to an algorithm. The goal is to support a reasonable subset of portable algorithms. Thus, the flexible approach might be unmotivated.
The flexible approach would exclude some backend choices if we happen to choose an incompatible set of input parameters. That seems undesirable.
An alternative to the flexible approach is to allow the pinning of a backend to the algorithm's function call, e.g. we can specify to use igraph's backend if we definitely want igraph to perform the algorithm. This would allow us to specify backend specific parameters that the caller knows the backend supports. This might solve the problem.
This is like a NodeMap
where the value is a Vector
. Just as a NodeMap
can be converted to a Vector
, a NodeEmbedding
can be converted to a Matrix
.
I'm unsure about the name. NodesEmbedding
? NodeEmbeddings
? NodeMapToVectors
? NodeMapOfVectors
? NodeToVectors
?
In the current DaskResolver, we sometimes construct the task for calling an algorithm with the ConcreteAlgorithm as the first argument, or sometimes we use ph_apply(func, args, kwargs)
(https://github.com/metagraph-dev/metagraph/blob/main/metagraph/core/plugin.py#L618-L628) as the first argument, effectively "currying" the keyword arguments since Dask doesn't have a direct way to represent them in the task.
The ambiguity between these two scenarios complicates #166, which needs to walk the graph and find compilable tasks. If we expand ph_apply()
to a full-blown callable class (MetagraphTask
) and always use it, even when only positional args are needed, we can streamline some of the compilation logic as well as have a place to stash metadata attributes about the task that Dask otherwise has nowhere to put.
In fact, I think this would eliminate the need for #180, as we could stash the resolver reference in the MetagraphTask
and pass it down to the ConcreteAlgorithm when needed for compilation, or other use.
@jim22k: Thoughts? If this seems reasonable, I can do it as part of #166.
Pytest is reporting with Pandas 1.1.1 the following:
anaconda/envs/mg_q3/lib/python3.8/site-packages/metagraph/tests/types/test_dataframe.py::test_pandas
/Users/sseibert/anaconda/envs/mg_q3/lib/python3.8/site-packages/metagraph/plugins/pandas/types.py:30: FutureWarning: The 'check_less_precise' keyword in testing.assert_*_equal is deprecated and will be removed in a future version. You can stop passing 'check_less_precise' to silence this warning.
pd.testing.assert_frame_equal(
I think a recent commit now assumes scikit-learn is present, causing metagraph-numba testing to fail if it isn't available to import.
Is it a problem that we are hiding both bool
and object
from builtins here?
https://github.com/ContinuumIO/metagraph/blob/master/metagraph/core/dtypes.py#L3
https://github.com/ContinuumIO/metagraph/blob/master/metagraph/core/dtypes.py#L20
Here's a go at describing how to unify the ideas in both #2 and #3. My assumptions are that we need to specify:
Graph
, DenseArray
, etc) and Concrete types (NumPyArray
, etc)NumPyArray
) and actual data objects / classes (numpy.ndarray
)weight_name
) needed to allow the external object to be fully convertable between other concrete types.There are still a bunch of aesthetic/functional choices here, so here's my proposed choices with some explanation:
Type classes are distict from data objects:
typing
module declares List
to describe list)IncidenceMatrix(transposed=False)
asks for a particular property, whereas all properties not listed (as in IncidenceMatrix
or IncidenceMatrix()
) are implicitly "any".For now, let's use actual the type classes in plugins, rather than strings. If this creates some circular import issues, we can optionally allow strings to be used as substitutes, similar to the convention with other Python type signatures.
Python inheritance is not used across the abstract and concrete type classes divide. Inheritance may be used between abstract types classes where needed.
Python inheritance is not used between concrete type classes and the data objects they describe. This is because those objects may be defined in codebases we cannot modify.
Python type signatures are used to describe both abstract and concrete algorithms for metagraph.
Types in those signatures which are instances of the metagraph abstract or concrete types will participate in the automatic translation and dispatch mechanism. Types which are basic python types (like int
, str
or tuple
) represent "scalars" which will be passed by value without conversion.
Let's use Graph
and WeightedGraph
to see how this works.
class GraphType(AbstractType):
'''A graph is a collection of nodes and edges that connect nodes.'''
pass # nothing more to specify
class WeightedGraphType(GraphType):
'''A graph that specifies a numeric weight value for each edge'''
# a weighted graph can be converted to a graph, but a graph
# cannot be converted to a weighted graph
pass
Note that abstract types are basically just a class and a docstring, with inheritance showing how things might be related to each other.
An instance of networkx.DiGraph
meets our requirement for Graph
but not WeightedGraph
. For a weighted graph, we will need to define an extra wrapper class to carry the attribute name of the weight.
class NetworkXWeightedGraph:
def __init__(self, graph, weight_label):
self.graph = graph
self.weight_label = weight_label
assert isinstance(graph, nx.DiGraph)
assert (
weight_label in graph.nodes(data=True)[0]
), f"Graph is missing specified weight label: {weight_label}"
Now we need some types to describe the NetworkX instances above. Let's assume our base ConcreteType
looks like this:
class ConcreteType:
abstract = None # must override this
allowed_props = frozendict() # default is no props
target = 'cpu' # key may be used in future to guide dispatch
def __init__(self, abstract, **props):
self.abstract = abstract
for key in props:
if key not in allowed_props:
raise KeyError(f'{key} not allowed property of {self.__class__}')
# maybe type check?
self.props = dict(props) # copying to be paranoid
def is_satisfied_by(self, other_type):
# check if other_type is at least as specific as this one
if isinstance(other_type, self.__class__):
for k in self.props:
if self.props[k] != other_type.props[k]:
return False
return True
def __eq__(self, other_type):
return isinstance(other_type, self.__class__) and \
self.props == other.props
def isinstance(self, obj):
# Return True if obj is an object described by this Concrete Type
raise NotImplementedError()
def get_props(self, obj):
# Return a dict of properties that this object satisfies
raise NotImplementedError()
The type for the NetworkX graph is:
class NetworkXGraphType(ConcreteType):
abstract = GraphType
value_class = nx.DiGraph
allowed_props = frozendict({
# placeholders for now
'foo': bool,
'bar': int,
})
def isinstance(self, obj):
# default implementation means this doesn't have to exist unless you need to programmatically identify values of this type
And the weighed graph is:
class NetworkXGraphType(ConcreteType):
abstract = WeightedGraphType
value_class = NetworkXWeightedGraph
allowed_props = frozendict({
# placeholders for now
'baz': str,
})
A translator is a function that takes a value of one concrete type and maps it to a value of another concrete type (optionally with the desired type properties asserted). A translator might look like this:
@metagraph.translator
def nx_to_cugraph(src: NetworkXGraphType, **props) -> CuGraphType:
# insert implementation here
For simplicity of dispatch, a translator must be able to handle all properties of both the source and destination concrete type. The decorator is used to add any additional methods or attributes to the functions that the system will find useful. Note that the decorator does not record this function in any global registry (see below).
Note that if a concrete type has properties, it is necessary to define a "self-translator", which is used translate the value into one with the required properties:
@metagraph.translator
def nx_to_nx(src: NetworkXGraphType, **props) -> NetworkXGraphType:
# insert implementation here
The @metagraph.translator
decorator turns the function into a callable object with additional properties:
src_type
: ConcreteType class of sourcedst_type
: ConcreteType class of destinationAbstract Algorithms are just Python functions without implementations that have a type signature that includes Abstract Types. For example, the Louvain community detection might look like this:
from typing import List
@metagraph.abstract_algorithm(name='community.louvain')
def louvain(graph: GraphType) -> List[GraphType]):
'''Return the louvain subgraphs'''
pass
As with the translators, the decorator is used to add useful methods and attributes to the function, as we will see below.
Concrete algorithms look like the abstract algorithm, but use concrete types:
@metagraph.concrete_algorithm('community.louvian')
def nx_louvain(graph: NetworkXGraphType) -> List[NetworkXGraphType]:
# insert implementation here
Note that this decorator does not record the nx_louvain
method in a registry hidden inside of the abstract louvain
algorithm. Instead it converts the function into a callable class with attributes like:
nx_louvain.abstract_algorithm
: Reference back to the louvain
object.nx_louvain.check_args(*args, **kwargs)
: Check if argument list matches function signature.If we want to define a concrete algorithm that only accepts values with a particular property (allowed properties are enumerated in the concrete type), we can do that this way:
@metagraph.concrete_algorithm('community.louvain')
def special_nx_louvain(graph: NetworkGraphXType(foo=True, bar=4)) -> List[NetworkXGraph(foo=True)]:
# insert implementation here
This requires the input graph
to have both the property of foo=True
and bar=4
, and asserts that the return value has property foo=True
, but nothing else.
For both testing purposes, as well as creation of special contexts, we will want to encapsulate the state associated with the registry of types, translators and algorithms. We call this state a Resolver
, and it is responsible for:
There will be an implicit, global Resolver
created by metagraph when imported that is populated by all of the plugins in the environment. Empty resolvers can also be created and populated manually.
The Resolver
class will have methods like this:
class Resolver:
def register(
self,
*,
abstract_types: Optional[List[AbstractType]] = None,
concrete_types: Optional[List[ConcreteType]] = None,
translators: Optional[List[Translator]] = None,
abstract_algorithms: Optional[List[AbstractAlgorithm]] = None,
concrete_algorithms: Optional[List[ConcreteAlgorithm]] = None,
):
pass
def load_plugins(self):
'''Populate registries with plugins from environment'''
pass
def typeof(self, obj):
'''Returns fully specified concrete type of obj'''
pass
def convert_to(self, src_obj, dst_type, **props):
'''Converts src_obj to instance of dst_type with given properties'''
pass
def match_algo(self, abstract_algo, arg_types, kwarg_types):
'''Returns concrete algorithm that matches the given abstract
algorithm and args/kwargs'''
pass
As a convenience, the resolver can also dynamically generate the algorithm namespace below it. Ex:
res = Resolver()
res.load_plugins()
# dispatch and call immediately
mygroups = res.algo.community.louvain(mygraph)
# pick the concrete algo and return it
louvain_func = res.algo.community.louvain.match(mygraph)
Both concrete and abstract types should have a __repr__
that returns something like MyType(prop1=True, prop2='bar')
. This would make debugging much easier.
Also move CompileError to exceptions.py
Union types are needed is certain cases. For example:
BuildGraph(Union[EdgeSet, EdgeMap], Union[NodeSet, NodeMap, None]) -> Graph
NodeSelect(Union[NodeMap, NodeTable], NodeSet) -> Union[NodeMap, NodeTable]
DropNodes(Union[NodeMap, NodeSet], NodeSet) -> Union[NodeMap, NodeSet]
The common thread for these methods is that a variety of different types can be used for the inputs, but we never want to lose the information being passed in. Some of these can be thought of similar to Generics in other languages.
This is in contrast to something like:
TriangleCount(EdgeSet) -> Graph
For TriangleCount, we could pass in an EdgeMap
or a Graph
and auto-translate into an EdgeSet
because all we need for the algorithm is the data contained in the EdgeSet
. Everything else is ignored. For this case, we would not use a Union.
Key Takeaway:
Use a Union when multiple types are allowable, but metagraph should not auto-translate between the types. If you want metagraph to auto-translate between types, simply indicate the final type in the signature.
Additional Note about Optional type:
Optional[int]
becomes a Union behind the scenes in the Python typing system. It becomes Union[int, NoneType]
. Metagraph should never attempt to translate between a type and NoneType
, so this still fits the takeaway of never translating between types for Union types.
AbstractType
, metagraph will translate within and across types as allowed by unambiguous_subcomponents
Union[type, NoneType]
, we assume the user wanted this to be optional, so metagraph follows Rule 1 and additionally allows a None to be passed in.Union[type1, type2, ...]
, we allow any concrete type matching one of those types. Metagraph will translate within a type, but never across types.Union[type1, type2, NoneType]
, we follow Rule 3 and additionally allow a None to be passed in.#215 uses an old version of dask
.
Ideally, we'd use the latest version of dask
.
Before we dig deeply into the base types, let's define what a node is.
networkx
lets a node be any hashable object. Each node must be unique within a graph. Let's call this Hashable.
graphblas
requires nodes to be sequential uints starting from 0. Let's call this Sequential.
These seem to be the most common definitions of nodes used by various libraries. Sometimes there will be a concept of a node label. igraph
uses Sequential nodes, but lets you add a "label" attribute to each node, which needs to be a string. Most methods and algorithms which expect a node let you pass in either the uint or the string and it figures things out.
I see two approaches to handling the difference in node expectations as we convert between different library representations of Graph and Nodes.
Require all nodes to be Sequential. When this is not a natural constraint in a library (like networkx), we create a copy and assign each node a sequential number in init. We still keep the mapping around for reference, but that mapping is not passed around.
Give every node a label. For graphs with Sequential nodes, the uint number is the label. Maintain a mapping between node label and node position and pass this mapping around. This allows each backend library to use either the label or position as appropriate for their implementation.
Note: this is our current approach with node_index
There is a branch that builds Dask tasks for algorithms in one of two ways, depending on whether the return value is a registered concrete type, or a python type:
https://github.com/metagraph-dev/metagraph/blob/main/metagraph/core/dask/resolver.py#L215-L230
This doesn't seem to be necessary and will make visualization and analysis harder. The result should always be a Placeholder object.
Type | Description | Characteristics (guidance) | Abstract Properties | Required Wrapper Methods/Properties |
---|---|---|---|---|
node_id | - uint - not really a type, more of a concept |
|||
Enum or Categorical? | - 1:1 mapping from str to uint - can be used to label nodes using strings |
- O(1) lookup in either direction | ||
Vector | 1-D values | - single dtype - no missing values allowed - O(1) lookup by position - iteration order guaranteed |
dtype | |
SparseVector | Vector with missing values | - single dtype - missing values allowed - O(1) lookup by position - O(1) test for emptiness by position |
dtype | |
Matrix | 2-D values | - single dtype - no missing values allowed - O(1) lookup by position |
dtype, is_square | |
SparseMatrix | Matrix with missing values | - single dtype - missing values allowed - O(1) lookup by position - O(1) test for emptiness by position |
dtype, is_square | |
DataFrame | - 2-D table - each column has a unique string label - each column has a single dtype |
|||
NodeSet | a set of nodes | - iteration order not guaranteed - O(1) indication of node inclusion |
num_nodes , __contains__ |
|
NodeMap | a node:value mapping | - iteration order not guaranteed - O(1) lookup by node - single dtype - no missing values |
dtype | num_nodes , __getitem__ , __contains__ |
NodeTable | represents a set of nodes, with multiple values for each node; think of it as a DataFrame with a row index of nodes and a column for each property | - each property has a single dtype - each property has a unique string label - values are allowed to be empty |
num_nodes , num_properties |
|
Graph | a set of nodes plus an edge:value mapping | - iteration order not guaranteed - O(1) lookup for node inclusion - O(1) lookup for edge value - O(1) lookup for edge presence |
dtype, is_directed, is_weighted, has_negative_weights | num_nodes , num_edges |
BipartiteGraph | two distinct sets of nodes with edges allowed only between the sets; edges have values | - each node set is logically distinct, but not necessarily numerically distinct; meaning node 1 in set A is different than node 1 in set B - single dtype for edge values - edges are undirected |
dtype, is_weighted | num_A_nodes , num_B_nodes , num_edges |
PropertyGraph | multiple graphs with the same nodes, each graph represents a unique property; each property has a unique set of edge weights | - each graph has a single dtype - each graph has a string label associated with the property |
is_directed | num_nodes , num_properties |
Which types require a Wrapper? Probably all the Nodes and Graph variants
Should we formalize the hierarchy of abstract properties?
A Wrapper auto-creates a ConcreteType pointing to itself. In the process, it moves some methods from the Wrapper to the ConcreteType. This is a very magical process which I want to make more transparent to the user.
Encode the hard-coded list of methods, classmethods, and staticmethods in Wrapper.__init_subclass__
.
Require all "auto-move" methods to be decorated. The decorator will flag the methods for move, eliminating the need for a hard-coded list of methods.
An additional benefit is that any method can be moved, allowing for more flexibility of helper methods in the ConcreteType.
It also makes the "auto-move" behavior more explicit to the user.
One question I haven't settled on is what to name the decorator:
class MyNewGraph(GraphWrapper, abstract=Graph):
# Option 1
@Wrapper.type_method
@classmethod
def _compute_abstract_properties(cls, ...):
pass
# Option 2
@ConcreteType.method
@classmethod
def _compute_abstract_properties(cls, ...):
pass
# Option 3: ConcreteType has a `method`, `classmethod`, and `staticmethod` decorator
@ConcreteType.classmethod
def _compute_abstract_properties(cls, ...):
pass
I think I like Option 3, although it might throw off syntax highlighting in certain IDEs because we're not using proper classmethod
and staticmethod
decorators. Instead, we add those decorators at "move" time.
The Metagraph conda recipe is not including the *.html files required to make the explorer work.
It should also add websockets
as a dependency, as it is required for the explorer.
For example,
mg.translate(
ss.csr_matrix([[1, 0], [0, 0]]),
mg.resolver.types.SparseMatrix.GrblasMatrixType
)
complains about not finding numpy.longlong
. There may be other datatypes that we should be able to handle, but don't.
The Dask API for Metagraph needs to:
Unlike other dask-aware data types, we are not looking to make Graphs, etc partitioned, as we are deferring any choices about parallel data layouts to the backend which likely needs to communicate in a lower overhead, more tightly coupled way (like MPI). However, we also will want to enable translators (and maybe some algorithms) that take in Dask partitioned data structures (like dask.dataframe and dask.array) as inputs and do the conversion in a parallel fashion if possible. Example here would be taking a Dask dataframe edge list loaded from many worker nodes and turning it into a Graph in a backend that has some kind of distributed shared memory. We don't need to address this use case immediately, but we should keep it in mind.
This is a rough sketch of how this could look to start discussion:
import dask.dataframe as dd
import metagraph.dask as dg # for "dask graph"
passenger_flow_df = dd.read_csv("airline.csv")
# do some data prep here
passenger_flow_edge_map = dg.EdgeMap.PandasEdgeMap(passenger_flow_df,
'ORIGIN_CITY_MARKET_ID',
'DEST_CITY_MARKET_ID',
'INVERSE_PASSENGER_COUNT',
is_directed=True)
passenger_flow_graph = dg.util.graph.build(passenger_flow_edge_map)
betweenness_centrality = dg.centrality.betweenness(passenger_flow_graph, normalize=False)
number_of_best_scores = 15
best_betweenness_centrality_node_vector = dg.util.nodemap.sort(betweenness_centrality, ascending=False, limit=number_of_best_scores)
best_betweenness_centrality_node_set = dg.util.nodeset.from_vector(best_betweenness_centrality_node_vector)
best_betweenness_centrality_node_to_score_map = dg.util.nodemap.select(betweenness_centrality, best_betweenness_centrality_node_set)
best_betweenness_centrality_node_to_score_map
best_betweenness_centrality_node_to_score_map = dg.translate(best_betweenness_centrality_node_to_score_map, dg.NodeMap.PythonNodeMap)
print(best_betweenness_centrality_node_to_score_map.compute().value)
Ideally, plugin authors would write no extra code to enable the above. If a translator or algorithm returns a Dask task graph rather than a bare object, the system should not attempt to wrap the DAG in another Dask object, but just pass it through unchanged. This will allow concrete algorithms and translators that take Dask inputs to return Dask outputs that respect the partitioning of the input.
I encountered this strange phenomenon:
>>> import metagraph as mg
>>> type(mg.types.EdgeMap(is_directed=True))
<class 'metagraph.types.EdgeMap'>
>>> mg.types.EdgeMap
<class 'metagraph.types.EdgeMap'>
>>> issubclass(mg.types.EdgeMap(is_directed=True), mg.types.EdgeMap)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: issubclass() arg 1 must be a class
>>>
>>> import inspect
>>> inspect.isclass(mg.types.EdgeMap(is_directed=True))
False
This is confusing.
If mg.types.EdgeMap(is_directed=True)
is a class, then ideally we wouldn't get the errors above.
If mg.types.EdgeMap(is_directed=True)
is not a class, then ideally type(mg.types.EdgeMap(is_directed=True))
would print something less misleading.
When using metagraph
locally in a dev environment, I need to do this
import metagraph as mg
from metagraph import default_plugins
mg.resolver.register(**default_plugins.find_plugins())
because we don't have entry points yet.
It would be nice this could be detected and register the default plugins anyway.
I know I keep spinning on this, but I think I finally have a solution.
Separating ConcreteTypes and Wrapper classes initially appears to make sense becauseย we need to specialize each ConcreteType with properties when writing algorithm signatures, but we don't have the underlying data yet.
I think there is a different way to still get easy parameterization within the function signatures, but allowing ConcreteTypes and Wrapper classes to merge into the same class. They essentially already have a 1:1 relationship, so I feel like they should not be separated.
As an example, let's look at the Graph
abstract type. nx.DiGraph
can fully satisfy this abstract class, so we can create:
class NxGraph(ConcreteType):
ย ย abstract = Graph
ย ย value_class = nx.DiGraph
While this seems ideal, it still comes with the drawback that concrete algorithms are spec'd as:
def NxAddRandomEdges(g: NxGraph) -> NxGraph:
ย # assume g is an instance of nx.DiGraph, not NxGraph -- potential source of confusion
ย ย # add random edges where they don't currently exist
ย ย return new_g ย # of type nx.DiGraph
Your IDE will complain that new_g
is of type nx.DiGraph
, while the signature wants NxGraph
.
When you use the library, you simply write:
g = nx.DiGraph()
add_starting_edges(g)
g2 = mg.algo.AddRandomEdges(g)
This is ideal in the world where the raw object fully satisfies the abstract type and there are no properties. Unfortunately, things get more complicated.
Now take the case where the raw object doesn't fully satisfy things: scipy.sparse.csr_matrix
class ScipyAdjacencyMatrix:
ย ย def __init__(self, mat, transposed=False):
ย ย ย ย self.obj = mat
ย ย ย ย self.transposed = transposed
class ScipyAdjacencyMatrixType(ConcreteType):
ย ย abstract = Graph
ย ย value_class = ScipyAdjacencyMatrix
And now concrete algorithms look like this:
def ScipyAddRandomEdges(g: ScipyAdjacencyMatrixType) -> ScipyAdjacencyMatrixType:
ย ย # assume g is of type ScipyAdjacencyMatrix -- confusing
ย ย # add random edges where they don't currently exist
ย ย return ScipyAdjacencyMatrix(mat_with_new_edges)
Your IDE will again complain about the type mismatch in the signature
Assume adjacency vs incidence were a property
class ScipyMatrix:
ย ย def __init__(self, mat, kind='adjacency', transposed=False):
ย ย ย ย self.obj = mat
ย ย ย ย self.kind = kind
ย ย ย ย self.transposed = transposed
class ScipyMatrixType(ConcreteType):
ย ย abstract = Graph
ย ย allowed_props = {'kind'}
ย ย value_class = ScipyMatrix
Now things can get more interesting with algorithms:
def ScipyAddRandomEdges(g: ScipyMatrixType(kind='adjacency')) -> ScipyMatrixType(kind='adjacency'):
ย ย ย # assume g is of type ScipyMatrix with kind=adjacency
ย ย # add random edges where they don't currently exist
ย ย return ScipyMatrix(mat_with_new_edges, kind='adjacency')
Here we see the utility of a separate ScipyMatrixType
which can be called with kind='adjacency'
completely separately from the actual ScipyMatrix
class.
When you use the library, you write:
mat = scipy.sparse.csr_matrix((data, (rows, cols)))
g = ScipyMatrix(mat, kind='adjacency') # Have to know about this specific class
g2 = mg.algo.AddRandomEdges(g)
.with()
method on ConcreteType which builds and returns a specialized subclass.with()
using the same arguments will return the same object. i.e. it only creates parameterized subclasses when seeing a new set of property values for the first timewith
isn't a valid method name. Bummer. Maybe go with .using()
or .make()
or .set()
or .require()
?Vector(some_dict)
can't disambiguate which concrete type to instantiate. Properties would be needed, with a single PythonSparseVector class that takes type=1 or type=2 as a property.Let's work through all the same examples as above.
nx.DiGraph
can fully satisfy the Graph
abstract class.
class NxGraph(ConcreteType):
ย ย abstract = Graph
ย ย value_class = nx.DiGraph
def __init__(self, g: nx.DiGraph):
self.obj = g
Now we have a concrete algorithm:
def NxAddRandomEdges(g: NxGraph) -> NxGraph:
ย # assume g is an instance of NxGraph, must retrieve graph using .obj
ย ย # add random edges where they don't currently exist
ย ย return NxGraph(updated_nx_graph)
Your IDE no longer complains about the signature.
When you use the library, you simply write:
g = nx.DiGraph()
add_starting_edges(g)
g2 = mg.algo.AddRandomEdges(Graph(g))
This involves wrapping nx.DiGraph
in a Graph
, but the user doesn't need to know about NxGraph
. They can use the abstract type and it will figure it out based on registered concrete types.
Now take the case where the raw object doesn't fully satisfy things: scipy.sparse.csr_matrix
class ScipyAdjacencyMatrix(ConcreteType):
abstract = Graph
ย ย value_class = scipy.sparse.spmatrix
ย ย
def __init__(self, mat: scipy.sparse.spmatrix, transposed=False):
ย ย ย ย self.obj = mat
ย ย ย ย self.transposed = transposed
No need to define two classes. And now the concrete algorithm looks like this:
def ScipyAddRandomEdges(g: ScipyAdjacencyMatrix) -> ScipyAdjacencyMatrix:
ย ย # assume g is of type ScipyAdjacencyMatrix
ย ย # add random edges where they don't currently exist
ย ย return ScipyAdjacencyMatrix(mat_with_new_edges)
Type signature matches, so the IDE is happy
Assume adjacency vs incidence were a property
class ScipyMatrix(ConcreteType):
abstract = Graph
ย ย allowed_props = {'kind'}
ย ย value_class = scipy.sparse.spmatrix
ย ย def __init__(self, mat: scipy.sparse.spmatrix, kind='adjacency', transposed=False):
ย ย ย ย self.obj = mat
ย ย ย ย self.kind = kind
ย ย ย ย self.transposed = transposed
Now things can get more interesting with algorithms:
def ScipyAddRandomEdges(g: ScipyMatrix.with(kind='adjacency')) -> ScipyMatrix.with(kind='adjacency'):
ย ย ย # assume g is of type ScipyMatrix with kind=adjacency
ย ย # add random edges where they don't currently exist
ย ย return ScipyMatrix(mat_with_new_edges, kind='adjacency')
The .with(kind='adjacency')
call actually creates a new parameterized subclass. Because it is still a subclass of ConcreteType, the resolver can figure out everything it needs about dispatching.
And now for the final trick, let's see how a user might call this.
mat = scipy.sparse.csr_matrix((data, (rows, cols)))
g = Graph(mat, kind='adjacency') # Just use the abstract type and pass properties
g2 = mg.algo.AddRandomEdges(g)
In addition to properties affecting dispatch, they are also required if an abstract type cannot figure out the appropriate concrete type from a raw object. Essentially, we view going from abstract to concrete as a type of dispatch, which properties affect.
As an example, look at how scipy.sparse.spmatrix
is used as a Graph
. It could be an adjacency matrix or an incidence matrix. Because you can't figure it out solely by inspecting the matrix, there should be a property for the user to indicate that.
Creating two separate concrete types (e.g. ScipyAdjacencyMatrix and ScipyIncidenceMatrix) won't work anymore because they share the same raw object, but neither has any properties to help disambiguate one class from the other. This technically would still be allowed, but only if users begin treating ScipyAdjacencyMatrix as a raw object and wrap a concrete type around it. There may be corner cases where that makes sense, but usually the raw objects will be well-known objects from other libraries that users are already familiar with.
ScipyEdgeSet
and ScipyEdgeMap
currently have a _node_list
attribute that acts a map between the label of the node and index of the corresponding row/column in the SciPy matrix.
There's currently not established way of accessing this without getting ours hands dirty by using a private attribute.
Perhaps we should have some way of getting a NodeLabel
out of a ScipyEdgeSet
or ScipyEdgeMap
? One idea would be to make it required to have a public "node_label" property or accessor that'll return a NodeLabel
.
It looks like test_hits_centrality:
metagraph/metagraph/tests/algorithms/test_centrality.py
Lines 230 to 255 in d828260
is now failing. I get:
FAILED metagraph/tests/algorithms/test_centrality.py::test_hits_centrality - AssertionError: Mismatch for node 0: 1.0693502568464412e-135 != -1.375893824035368e-19
This is with NetworkX 2.6.2 on macOS x86-64. The concrete algorithm is here:
metagraph/metagraph/plugins/networkx/algorithms.py
Lines 201 to 206 in 7f316d1
Buried inside metagraph is a generic system that allows for pluggable abstract types, translators, and dispatch (with optional automatic translation). There's nothing really graph specific about this system, so it could be spun out into its own package, with metagraph being condensed to just the abstract types and algorithms for graph analytics.
I think it makes sense to wait on this until metagraph is more mature and we more fully explore the translation and dispatch feature space. Once that core is more mature, a spin out could make sense if there is interest. (Feel free to watch this issue to be notified when this happens.)
I noticed that we copy ConcreteAlgorithms when registered in the resolver:
https://github.com/metagraph-dev/metagraph/blob/main/metagraph/core/resolver.py#L572-L573
This means we could, at time of registration, set an attribute on the ConcreteAlgorithm to the resolver without causing any issue if there are for some reason two resolvers with the same algorithm registered to both? Having that reference would simplify the logic here:
https://github.com/metagraph-dev/metagraph/blob/main/metagraph/core/plugin.py#L618-L628
As it would no longer require the resolver to be passed as an explicit argument. This is the trick I used in #166, but I would like to generalize it, if it makes sense.
@jim22k: Thoughts?
While beginning to write real concrete algorithms, I came up with a few questions.
What does everyone think about
@abstract_algorithm('link_analysis.pagerank')
def pagerank(...):
pass
@concrete_algorithm(pagerank)
def nx_pagerank(...):
return ...
Currently, the decorator must be @concrete_algorithm('link_analysis.pagerank')
,
which is nice when the concrete algorithm and abstract are in separate modules.
But sometimes they're right next to each other. Both forms are readable. I suggest we
support both styles.
Where do default values belong?
@abstract_algorithm('link_analysis.pagerank')
def pagerank(g: Graph, damping: float = 0.85) -> DenseVector:
pass
@concrete_algorithm('link_analysis.pagerank')
def nx_pagerank(g: nx.DiGraph, damping: float = 0.75) -> NumpyVector:
return ... # ^^^^ allow this?
Defining the defaults in the abstract definition seems best so that switching paths doesn't
accidentally change the call -- for example, if nx_pagerank
had a different damping than
cugraph_pagerank
. I wouldn't want the damping factor to change just because the solver
decided to use the GPU and the user didn't specify what damping factor to use.
Given the resolver solves paths through the landscape of types, translators, and algorithms, it would be nice to visualize that landscape for users. This would show both the number of types as well as completeness of translator pathways.
Brainstorming with @paul-tqh-nguyen, we decided to show:
We also want to have solvers for translation paths and algorithm choice.
Paul will focus on front-end development. The translator page will likely use d3.js. Other pages will be more of filterable tree widgets. The whole thing will likely run as a single-page application with dynamic calls via websockets and JSON data being passed back and forth.
Jim will focus on the back-end development. The structure will likely be a simple websocket design using async calls. It won't be as performant as a real webserver, but it should only have a single client hitting it, so it will suffice. The goal will be to write a temporary webpage and open a browser tab pointing to the temp file. Once the page is closed, the websocket connection will disconnect and the server process will terminate. This will give enough dynamic nature to avoid duplicating logic in javascript without the complexity of a full framework like flask or django.
With abstract properties, we require that increasing specificity of properties be done via an algorithm.
For example, going from undirected=False
to undirected=True
, there might be an algorithm called treat_edges_as_undirected(Graph(undirected=False)) -> Graph(undirected=True)
. This allows the user to be explicit when information is dropped.
However, an algorithm drop_weights
could work for both a Graph
and a Mapping
, treating all weights as a value of 1. Currently, there is no way in metagraph to define an algorithm with polymorphic signature.
A better approach in this case is to have a drop_weights
method on the objects, i.e. my_graph.drop_weights()
or my_mapping.drop_weights()
. This only works for algorithms which take and return the same concrete type.
All concrete types would need to implement the method. Otherwise, the method becomes unavailable. We definitely don't want to call my_scipy_adj_matrix.drop_weights()
and have metagraph convert to networkx, drop the weights, and convert back to scipy adjacency matrix. I can't imagine a case where that isn't surprising to a user.
The actual implementation could be as an algorithm in the normal sense with the method being a special handling by metagraph. Or, we could have users write it as an actual method on their concrete type. I'm more inclined to go with the latter.
@jim22k, I sometimes get this failure when running tests on master:
_________________________________________________________ test_python _________________________________________________________
default_plugin_resolver = <metagraph.core.resolver.Resolver object at 0xa1c9e6550>
def test_python(default_plugin_resolver):
dpr = default_plugin_resolver
sparse_dict = {1: 1.1, 5: 5.5, 2: 2.2}
x = PythonSparseVector(sparse_dict, size=30)
assert len(x) == 30
# Convert back and forth from numpy sparse array
nps = dpr.translate(x, NumpySparseVector)
assert isinstance(nps, NumpySparseVector)
> y = dpr.translate(nps, PythonSparseVector)
metagraph/tests/translators/test_sparsevector.py:15:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
metagraph/core/resolver.py:338: in translate
translator = self.find_translator(value, dst_type)
metagraph/core/resolver.py:325: in find_translator
ret_val = self.find_translation_path(src_type, dst_type)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <metagraph.core.resolver.Resolver object at 0xa1c9e6550>
src_type = <class 'metagraph.default_plugins.wrappers.numpy.NumpyVectorType'>
dst_type = <class 'metagraph.default_plugins.wrappers.python.PythonSparseVectorType'>
def find_translation_path(self, src_type, dst_type) -> Optional[Translator]:
import scipy.sparse as ss
abstract = dst_type.abstract
if abstract not in self.translation_matrices:
# Build translation matrix
concrete_types = [
ct for ct in self.concrete_types if issubclass(ct.abstract, abstract)
]
m = ss.dok_matrix((len(concrete_types), len(concrete_types)), dtype=bool)
for s, d in self.translators:
try:
sidx = concrete_types.index(s)
didx = concrete_types.index(d)
m[sidx, didx] = True
except ValueError:
pass
sssp, predecessors = ss.csgraph.dijkstra(
m.tocsr(), return_predecessors=True, unweighted=True
)
self.translation_matrices[abstract] = (concrete_types, sssp, predecessors)
# Lookup shortest path from stored results
concrete_types, sssp, predecessors = self.translation_matrices[abstract]
> sidx = concrete_types.index(src_type)
E ValueError: <class 'metagraph.default_plugins.wrappers.numpy.NumpyVectorType'> is not in list
Something changed in importlib_metadata=3.7.3 (previously 2.0.0) which breaks test_load_duplicate_name. Looking at sys.path, both site_dir and bad_site_dir are present. But when grabbing entry points, only bad_site_dir is represented (along with core metagraph).
Rather than digging in, python=3.8 introduced importlib.metadata. We should use that instead. And until we stop supporting python=3.7 in metagraph, let's just pin to importlib_metadata=2.0.0 to avoid the bug.
Since Placeholder objects represent their final value with a single key (like Delayed
), it makes sense for them to have a key
property, similar to Delayed
. See:
https://github.com/dask/dask/blob/master/dask/delayed.py#L525-L527
This will make writing tests more convenient.
MultiVerify is sufficiently long and complex that it could use its own unit tests:
https://github.com/ContinuumIO/metagraph/blob/master/metagraph/tests/algorithms/__init__.py#L11-L166
Our current usage of MultiVerify only hits 58% of the lines, so many corner cases of the utility are not even exercised.
Properties still don't feel fully fleshed out in the codebase. Here are my thoughts on the direction we should go regarding properties.
Some properties affect algorithms computationally, not just in how the data is stored and accessed. For example, triangle counting only works for nondirected graphs.
Such properties should be defined on the abstract type, not on concrete types. This would allow abstract algorithms to indicate required properties which all implementations must satisfy.
Axioms of properties on abstract types
In the tests, we have a StrType with a lowercase
property. A string with lowercase=True
is still valid if the property is changed to lowercase=False
because it is already in a narrower, more constrained state. Going from the more general lowercase=False
to True
requires changing the data.
This pattern shows up in our real types:
A property like is_transposed
on an AdjacencyMatrix is easily reversible. Dataframe labels are also reversible. If the current src_label='SOURCE'
, it's easy to request the src_label='SRC'
and then reverse the process back to 'SOURCE'.
These properties do not affect the computation. They will affect the specific algorithm, but changing the representation of the data never loses information.
Any property which doesn't fit the requirements of an abstract property must be a concrete property, defined separately for each concrete type.
Axioms of properties on concrete types
We already have the concept of subclassed abstract types, like WeightedGraph vs Graph. This was done because we recognized that weight can go one way, but is not reversible. Abstract properties generalize this finding, so we may not need abstract type subclassing.
With abstract properties, I think we only need a few abstract types now (listed along with their properties):
Translators can easily deal with concrete properties because they are reversible. Self-translators only deal with concrete properties.
Abstract properties are more nuanced when it comes to translators. In general, translators should avoid changing abstract properties when converting from one concrete type to another. If that is not possible because the destination type does not support the source's level of specificity, the translator should generalize that abstract property to the maximum allowed by the destination. Doing so will not lose any information. However, the reverse should never be done. A translator may not increase the specificity of an abstract property. Such an action requires an algorithm or possibly a constructor.
Rules for translators
There are some unused parameters in the compiler APIs to pass literal values (i.e. values that should be frozen at compile time and can be used to influence code generation) to the compiler. These eventually need to be wired up after #173, which will make the distinction between compile-time literal values (which won't be wrapped in a placeholder class) and runtime values (which will be wrapped in placeholder classes) more clear.
(Eventually we will need some additional logic to allow things like a trained graph embedding to be treated as a literal so it can be fused with another ML algorithms.)
Concrete types without a value_type
like NumpyMatrixType
and NumpyVectorType
are breaking the explorer code.
metagraph/explorer/api.py:170
AttributeError: 'NoneType' object has no attribute '__module__'
MultiVerify (defined in the testing framework for algorithms), has two methods:
The first handles translation from different return types and does an exact comparison. The challenge is that many algorithms are not deterministic, which is a requirement for exact comparison of outputs.
The second is fully flexible, but puts the burden on the user to compare everything.
With the addition of utility functions, many nondeterministic results can be made deterministic. For example, in max flow we know which nodes are the bottlenecks and doing an edge aggregation followed by a node selection, we have a deterministic result to compare.
I propose adding enhancements to MultiVerify to allow calls something like this:
mv = MultiVerify(resolver)
results = mv.compute("flow.max_flow", graph) # returns (flow_rate, graph_with_actual_flow_on_edges)
# Compare flow rate
results[0].assert_equal(6)
# Normalize actual flow to prepare to transform
actual_flow = results[1].normalize("NetworkXGraphType")
# Compare sum of out edges for bottleneck nodes
out_edges = mv.transform(core_networkx, "util.graph.aggregate_edges", actual_flow, lambda x, y: x + y, initial_value=0)
out_bottleneck = mv.transform(core_networkx, "util.nodemap.select", out_edges, PythonNodeSet({2, 4}))
out_bottleneck.assert_equal(PythonNodeMap({2: 6, 4: 6}))
# Compare sum of in edges for bottleneck nodes
in_edges = mv.transform(core_networkx, "util.graph.aggregate_edges", actual_flow, lambda x, y: x + y, initial_value=1, in_edges=True, out_edges=False)
in_bottleneck = mv.transform(core_networkx, "util.nodemap.select", in_edges, PythonNodeSet({2, 4}))
in_bottleneck.assert_equal(PythonNodeMap({2: 6, 4: 6}))
@eriknw found this bug
Six tests are failing after upgrade to scipy==1.6.1. Some translation is converting ints to np.float64 rather than np.int64.
If we define some algorithms like this:
@abstract_algorithm('testing.scale')
def testing_scale(a: Vector, scale: float)->Vector: #pragma: no cover
pass
@concrete_algorithm('testing.scale', compiler="identity")
def compiled_scale(a: NumpyVectorType, scale: float)->NumpyVectorType:
return a * scale
And then construct a value like this:
a = np.arange(100)
scale_func = res.algos.testing.scale
z = scale_func(scale_func(scale_func(a, 2.0), 3.0), 4.0)
The resulting dask graph is this:
{('call-8b12ab9c7500e5a04d2200176747e90e', 'testing.scale'): (<metagraph.core.plugin.ConcreteAlgorithm object at 0x7ff6080720d0>,
('call-8f1882ff20c40a1140f666ab9450cf58',
'testing.scale'),
3.0),
('call-8f1882ff20c40a1140f666ab9450cf58', 'testing.scale'): (<metagraph.core.plugin.ConcreteAlgorithm object at 0x7ff6080720d0>,
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),
2.0),
('call-a0001325df28ad22d68449a7ebc15273', 'testing.scale'): (<metagraph.core.plugin.ConcreteAlgorithm object at 0x7ff6080720d0>,
('call-8b12ab9c7500e5a04d2200176747e90e',
'testing.scale'),
4.0)}
In particular, notice this task:
('call-8f1882ff20c40a1140f666ab9450cf58', 'testing.scale'): (<metagraph.core.plugin.ConcreteAlgorithm object at 0x7ff6080720d0>,
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),
2.0),
has included the NumPy array (NumPyVectorType
) in the arguments for the task.
For various reasons, I think we want to actually wrap that array in a placeholder object as a separate task in the graph and then pass that key to the concrete algorithm. In particular, we will eventually want the ability to rewrite the graph to swap out one concrete algo with another, which might necessitate a translation step. It is cleaner if the translatable data (i.e., those values represented by concrete types) have been separated out into their own entries in the task graph. Also, by wrapping the value in a placeholder, we can retain the type metadata about the value, which will also help in some graph rewrite transformations.
This should not be allowed:
from metagraph import abstract_algorithm, PluginRegistry
from metagraph.plugins.core.types import Vector
from metagraph.plugins.numpy.types import NumpyVectorType
@abstract_algorithm('testing.scale')
def testing_scale(a: Vector, scale: float)->NumpyVectorType: # oops, bad return type
pass
registry = PluginRegistry('test_registry')
registry.register(testing_scale) # this should raise an error
Some type conversions and algorithms are obvious:
Some of these are already proposed as required wrapper methods.
Other useful algorithms require a callable:
These all require a callable of some sorts to be passed. It might be as simple as max
, min
, or sum
, but it could also be a user-defined function.
Aggregation (or reduction) methods generally take a binary function: lambda x, y: x + y
. There are some cases where all values are needed simultaneously -- for example, to calculate the median.
Initial Proposal:
Filters are more complicated that aggregators. Filters technically require a unary operator, but in practice are often binary operators with a scalar (>0, ==max_val, etc). Another common filter practice is to accept a boolean array as input.
Initial Proposal:
compare
method between two objects; accept common binary functions or user-defined binary functions. Could return a NodeSet instead of a boolean NodeMap.select
method which takes a NodeSet as input and only keeps those nodesSome of these changes require metagraph to modify inputs prior to dispatching. Jitting functions is an example of this.
Having a layer which allows the User API to differ from the concrete implementation API should be used with caution, but seems necessary and useful in several algorithms.
Currently the translation matrix cache in Resolver is a dictionary of tuples:
https://github.com/ContinuumIO/metagraph/blob/master/metagraph/core/planning.py#L93-L94
This could easily be made into a dataclass, possibly with the construction code as a method:
https://github.com/ContinuumIO/metagraph/blob/master/metagraph/core/planning.py#L93-L94
This would also make it possible to write unit tests just to test the translation matrix.
The default visualize() method in Dask uses circles for function nodes, which creates several with our current visualization of Metagraph DAGs (described below). We could fix this with our own visualize() method, and we could also add in some special features along the way.
The problem with circles for function nodes is that severely limits the length of the label unless you want to have function nodes which take up a disproportionally large area in the drawing of the DAG. To work around this limitation, we've forced our longer labels into the rectangular data nodes, which results in a confusing diagram. Data nodes contain labels describing translations (like NumpyNodeMapType->PythonNodeMapType
) or algorithm names (like util.nodemap.apply
), when those labels should be on the function nodes that precede the data node.
Ideally, data nodes should be labeled with the concrete data type whenever possible. Additionally, we can add some special kwargs to our custom visualize()
to do things like:
We can do this customization by constructing and passing function_attributes and data_attributes dictionaries to the underlying dask visualize()
method that override the default node styles, colors and labels.
We will also need to provide this visualize()
method as a standalone function for use in larger DAGs where the final node is not a Metagraph placeholder object, but something else that will have the standard visualize()
method.
In discussion of #28, we've encountered a problem that seems to be fundamentally about tracking metadata on objects defined outside of metagraph. Many of the properties that we may want to dispatch on are both:
Recomputing some properties every time they are needed is a non-starter, and attempting to cache properties on the object itself would likely surprise the user by mutating their objects (even if in invisible ways), potentially create strange bugs, or be impossible if the object forbids adding attributes.
Workarounds include:
In the case of Graph
, I am leaning more toward Metagraph needing to provide a wrapper, if only to try to create some uniformity between different Graph objects. Moreso than vectors and matrices, graphs can be defined in dramatically different ways between libraries, and some amount of manual intervention from the user is needed to describe how to interpret the graph.
However, for other objects, and for general usage, another approach is to have a property cache inside the Resolver object (which is the place where all of these properties are likely to be needed). This cache can track properties of externally-defined objects as they pass through the system, avoiding the need to do expensive recomputation or to mutate the object.
The main problem to solve with an object property cache is "what to use as a hash?". Not all objects are hashable, and some may be somewhat expensive to hash. The cheapest "hash" function to use is id(obj)
, which in CPython returns the memory address of the PyObject struct. The danger of persisting id()
in a data structure, like a cache, is that the id()
can be reused during a program if the object is deleted, and a new object is created which happens to get the same location in memory. In order to avoid this problem, we can use a weakref
to track if the object has been deleted since we last checked the property.
Here's a quick proof of concept of what this could look like:
import weakref
class Dummy:
pass
class PropertyCache:
def __init__(self):
self._prop_cache = {}
self._ref_check = weakref.WeakValueDictionary()
def __getitem__(self, obj):
objid = id(obj)
if objid in self._prop_cache and \
objid in self._ref_check: # verify weakref still valid
return self._prop_cache[objid]
else:
props = {}
self._prop_cache[objid] = props
self._ref_check[objid] = obj # weak reference
return props
prop_cache = PropertyCache()
obj = Dummy()
first_id = id(obj)
prop_cache[obj]['flavor'] = 'vanilla'
assert 'flavor' in prop_cache[obj]
assert first_id in prop_cache._ref_check # object should be present
del obj
assert first_id not in prop_cache._ref_check # object should be removed
A better implementation of this property cache should use the callback mechanism of weakrefs to also remove entries in self._prop_cache
when objects are deleted, and also should make __getitem__()
thread-safe.
I think by both having a Graph
wrapper and using the above property cache to track additional attributes of third-party objects, we can resolve the concerns with properties in metagraph.
When writing tests, this pattern is common to create a registry with some plugins for testing:
@abstract_algorithm('testing.scale')
def testing_scale(a: Vector, scale: float)->Vector: #pragma: no cover
pass
@concrete_algorithm('testing.scale', compiler="identity")
def compiled_scale(a: NumpyVectorType, scale: float)->NumpyVectorType:
return a * scale
registry = PluginRegistry("test_subgraphs_plugin")
registry.register(testing_scale)
registry.register(compiled_scale)
I find myself often forgetting to actually register the plugins at the end. While there are good reasons for this format in actual plugins (which will likely use registry.register_from_modules()
), I wish I had an alternate syntax like:
with PluginRegistry("test_subgraphs_plugin") as registry:
@abstract_algorithm('testing.scale')
def testing_scale(a: Vector, scale: float)->Vector: #pragma: no cover
pass
@concrete_algorithm('testing.scale', compiler="identity")
def compiled_scale(a: NumpyVectorType, scale: float)->NumpyVectorType:
return a * scale
that would auto-register all the plugin-like things defined with the context manager block. Unfortunately, I think implementing this requires too much magic to behave as expected, like snapshotting locals()
before and after the block to discover everything defined within it (which fails if a variable is redefined).
Since this is only for testing, perhaps a variant of register_from_modules()
, like register_from_locals()
which discovers everything in the local namespace of the fixture function (and is not recursive).
The data shown by display()
on these two classes is useful. Should we make them __str__()
for easier access? display()
can then just be print(str(self))
Some embedding algorithms take a list of Graphs as input. When running using dask, these remain as Placeholders when handed to the plugin code.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.