metagraph-dev / metagraph Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 7.0 2.86 MB

Multi-target API for graph analytics with Dask

Home Page: https://metagraph.readthedocs.io/en/latest/

License: Apache License 2.0

Python 93.04% HTML 5.65% CSS 1.30%

metagraph's People

Contributors

Stargazers

Watchers

Forkers

jim22k eriknw paul-tqh-nguyen seibert srravula1 vishalbelsare wkyu2kg

metagraph's Issues

API version

Thinking about the plugin architecture is making me want a versioned approach to the API, similar to how things are done for a REST API. With REST APIs, novice implementers write something like /item/1/update/, but experiences authors write /v1/item/1/update/ because they know that their API will evolve over time and they want a way to move forward without breaking the existing users of their REST API.

In a similar vein, we will choose types, properties, and abstract algorithms that plugins will implement. If we ever want to change those types, properties, and algorithm signatures or paths, we don't want to break all existing plugins. Let's think through the potential approaches to take.

Types

If we decide that our types are wrong, it should be easy enough to create new types. We will lose the ability to reuse a name (i.e. Vector) to mean something different, but that's probably okay. The likelihood is that we will introduce new types (BipartiteGraph, MultiGraph, NeuralNetwork) rather than wanting to change an existing type.

Properties

I have been thinking that I wanted to eliminate the hierarchy of abstract properties and instead make properties dependent on each other. For example, a Graph is_weighted or not. If is_weighted=True, then you can have a strictly_positive property. Similarly, for a Matrix the is_symmetric property only applies for a Matrix that is_ square. There is a hierarchy of properties depending on each other.

For a given property, there still is clearly a direction of narrowness or specificity, as I've discussed before. Going from narrow to general is easy. However, there is often a standard way to going in the other direction as well. Weights are simply dropped, directed edges ignore their direction, etc. As such, we could ignore the directionality of properties.

However, having a concept of general vs narrow settings for a property is nice when it comes to adding a new property to existing implementations. For example, if Graphs today have only properties is_directed and is_weighted, and we later add a property has_self_edges, all existing implementations of Graphs become ill-defined. They don't know about the new property and therefore cannot tell you if they have self-edges or not. If we maintain the concept of general vs narrow settings, then an undefined property is always set to its most general setting. This means that introducing a new property will keep all existing implementations fully defined, but restricted to the most general setting.

For the example of self-loops, having self-loops is the general case while not having self-loops is the narrow case. So we can assume all existing Graph types have self-loops, and slowly upgrade plugins with a knowledge of self-loop determination without breaking the whole ecosystem.

This ability to add new properties in a clean manner seems reason enough to keep the general vs narrow concept of property settings.

Algorithm Signature

Changing an abstract algorithm's signature is dangerous because it will no longer match existing concrete algorithm signatures. We could change the name of the algorithm, but we could end up in a case with sssp, sssp2, shortest_path, and ss_shortest_path, making it very confusing for a user to know which is the latest and greatest. Plenty of existing libraries already suffer from this condition. They deprecate functions, but cannot remove them for fear of breaking existing code.

We likely need some type of versioning with abstract algorithms tagged against v1, v2, etc. We will want to keep older versions around to avoid breaking existing user code and to avoid eliminating existing plugin implementations. Then we slowly deprecate older version or find a way to steer new user code to use the newer versions.

I'm not sure how to do this without making all user code look like r.algo.v1.link_analysis.pagerank(G) and r.algo.v2.rankings.google_pagerank(G). Anyone have good ideas?

Algorithm Path

In a similar vein, I like the categorization of abstract algorithms, but those categories might shift over time as the number of algorithms grows. In the ideal world, a user would be able to search for an algorithm in a notebook using a hierarchy or tags or other methods, and then call the algorithm independent of the mechanism used to find the algorithm name.

Something like: r.algo.link_analysis.<tab> would show that pagerank is a valid algorithm, but the user would actually write code like r.algo.pagerank to keep working code independent of the categories we choose.

Using a version approach (r.algo1, r.algo2) could handle this as each new version could change the categorization without breaking the previous version's style. I don't love that solution, but it would work. Any other ideas?

Comparing objects for unit tests

When writing unit tests, we will often have a need to check for equality or near-equality (within a tolerance) of two objects.

What is the best approach to achieving this? I'm leaning towards the 2nd approach.

Approach 1: Handled by each test

Each test knows what the expected result will be and can perform whatever checks it wants. If some types have an equality method, the test can call that, but there is no expectation for every type to provide such a method.

Approach 2: Handled by each type

Each type must provide an equality method. Because we aren't in control of raw types, this check will need to exist on the ConcreteType.

Calling exact concrete algorithms

When designing the API, @jim22k and I ran into problems where different backends might support the same algorithm with a different set of parameters.

For example, igraph's betweenness centrality takes an explicit set of nodes to perform betweenness centrality on as doing so on the whole graph might be prohibitively expensive. NetworkX's betweenness centrality takes an number k of nodes to randomly sample. cugraph's betweenness centrality has the same interface (but unfortunately raises an exception when k is passed in as it does not yet support randomly sampling k nodes since cudf does not, which in practice means that cugraph only supports betweenness centrality on the whole graph).

The current approach we're going with for designing the API is a "rigid" approach where we choose an interface for each algorithm that's somewhat fixed, easy to use, and is friendly to several backends. For the betweenness centrality example above, this approach works since we could easily choose a more general interface for our API, which would be one similar to that of igraph.

This might be sufficient for now for the problems we've come across so far. Two concerns associated with using the rigid approach going forward include:

As we include more and more plugins, the backend interfaces will have more diversity and it'll be less clear how to unify them with our API. Making backend interfaces conform to a fixed or slowly changing API is a challenging task as we might have to change the API more frequently as we include more backends.
For cases like cugraph's betweenness centrality, even if it doesn't support the case where we want to calculate the betweenness centrality for a subset of nodes, it does support the case for the full graph. It'd be nice if we could take advantage of that when possible.

One proposal for solving this is to use what I'm going to call the "flexible" approach.

The flexible approach would have it such that each function has a set of parameters where a few are required (e.g. you must specify a graph) and many are optional where some of the optional parameters might be mutually exclusive.

For example, the betweenness centrality interface might accept a number k of nodes to sample or a set of nodes to use, but not both.

Having this set of optional parameters makes our API more easily extendable to support new backends with different interfaces.

For the betweenness centrality case, someone could specify k, a set of nodes, or neither.

Specifying a set of nodes would make only igraph's backend an option.
Specifying k would make only NetworkX (since cugraph's lack of support for the k parameter effectively removes k from its interface).
Specifying neither would allow all three backends.
Specifying both k and a set of nodes would result in an exception.

That is the basic idea. The parts below discuss the downsides.

The rigid approach would require the metagraph resolver figure out only how to translate a fixed set of parameters to the same fixed set of parameters of a possibly different types. The flexible approach would require the metagraph resolver to figure out (A) which backend interfaces are applicable first and then (B) the type translations. Figuring out (A) might be non-trivial. For example if k for betweenness centrality is specified, we could "translate" it to igraph's backend by sampling k nodes ourselves. This process might get more complicated as we proceed. This additionally has the challenge that determining the fastest translation route would be more challenging.

The flexible approach is almost suggesting to make our interface for the algorithm F the union of all the interfaces of all the backends' interfaces for F. This will lead to many many parameters, which will make our API very confusing and hard to comprehend or use.

One thing to keep in mind that is that our goal isn't necessarily to support every backend or every possible interface to an algorithm. The goal is to support a reasonable subset of portable algorithms. Thus, the flexible approach might be unmotivated.

The flexible approach would exclude some backend choices if we happen to choose an incompatible set of input parameters. That seems undesirable.

An alternative to the flexible approach is to allow the pinning of a backend to the algorithm's function call, e.g. we can specify to use igraph's backend if we definitely want igraph to perform the algorithm. This would allow us to specify backend specific parameters that the caller knows the backend supports. This might solve the problem.

Add `NodeEmbeddings` abstract type

This is like a NodeMap where the value is a Vector. Just as a NodeMap can be converted to a Vector, a NodeEmbedding can be converted to a Matrix.

I'm unsure about the name. NodesEmbedding? NodeEmbeddings? NodeMapToVectors? NodeMapOfVectors? NodeToVectors?

Expand `ph_apply` to a callable class with metadata, and always use it

In the current DaskResolver, we sometimes construct the task for calling an algorithm with the ConcreteAlgorithm as the first argument, or sometimes we use ph_apply(func, args, kwargs) (https://github.com/metagraph-dev/metagraph/blob/main/metagraph/core/plugin.py#L618-L628) as the first argument, effectively "currying" the keyword arguments since Dask doesn't have a direct way to represent them in the task.

The ambiguity between these two scenarios complicates #166, which needs to walk the graph and find compilable tasks. If we expand ph_apply() to a full-blown callable class (MetagraphTask) and always use it, even when only positional args are needed, we can streamline some of the compilation logic as well as have a place to stash metadata attributes about the task that Dask otherwise has nowhere to put.

In fact, I think this would eliminate the need for #180, as we could stash the resolver reference in the MetagraphTask and pass it down to the ConcreteAlgorithm when needed for compilation, or other use.

@jim22k: Thoughts? If this seems reasonable, I can do it as part of #166.

Pandas warning: check_less_precise

Pytest is reporting with Pandas 1.1.1 the following:

anaconda/envs/mg_q3/lib/python3.8/site-packages/metagraph/tests/types/test_dataframe.py::test_pandas
  /Users/sseibert/anaconda/envs/mg_q3/lib/python3.8/site-packages/metagraph/plugins/pandas/types.py:30: FutureWarning: The 'check_less_precise' keyword in testing.assert_*_equal is deprecated and will be removed in a future version. You can stop passing 'check_less_precise' to silence this warning.
    pd.testing.assert_frame_equal(

Metagraph seems to require scikit-learn to be present for tests now

I think a recent commit now assumes scikit-learn is present, causing metagraph-numba testing to fail if it isn't available to import.

Problem with core.dtypes.bool,object hiding builtins?

Is it a problem that we are hiding both bool and object from builtins here?

https://github.com/ContinuumIO/metagraph/blob/master/metagraph/core/dtypes.py#L3
https://github.com/ContinuumIO/metagraph/blob/master/metagraph/core/dtypes.py#L20

[DISCUSSION] Relationship between typeclasses and data objects / wrappers and function signatures

Here's a go at describing how to unify the ideas in both #2 and #3. My assumptions are that we need to specify:

Relations between Abstract types (Graph, DenseArray, etc) and Concrete types (NumPyArray, etc)
Relations between Concrete types (NumPyArray) and actual data objects / classes (numpy.ndarray)
Properties of Concrete types (which may also be unspecified)
Wrappers for data that augment externally defined classes (like NetworkXGraph) with additional attributes (like weight_name) needed to allow the external object to be fully convertable between other concrete types.
Type signatures for abstract functions
Type signatures for concrete functions, which may also have concrete types that specify properties
Collections of Abstract and concrete types / functions that will be used for dispatching and automatic translation.

There are still a bunch of aesthetic/functional choices here, so here's my proposed choices with some explanation:

Type classes are distict from data objects:
- This is to allow a separation of concerns between "what is this thing?" and the thing itself. (This flexibility may be needed to deal with duck typing or protocols)
- Type classes are used in type signatures (in the way that the python typing module declares List to describe list)
- Instances of type classes can be used to specify property requirements in type signatures. Ex: IncidenceMatrix(transposed=False) asks for a particular property, whereas all properties not listed (as in IncidenceMatrix or IncidenceMatrix()) are implicitly "any".
- In principle this allows us to not have to create a data object wrapper around every third party class, only those that are underspecified.
- Where possible, it would be nice for metagraph to consume data from the user without imposing an additional "data preparation" step.
For now, let's use actual the type classes in plugins, rather than strings. If this creates some circular import issues, we can optionally allow strings to be used as substitutes, similar to the convention with other Python type signatures.
Python inheritance is not used across the abstract and concrete type classes divide. Inheritance may be used between abstract types classes where needed.
Python inheritance is not used between concrete type classes and the data objects they describe. This is because those objects may be defined in codebases we cannot modify.
Python type signatures are used to describe both abstract and concrete algorithms for metagraph.
Types in those signatures which are instances of the metagraph abstract or concrete types will participate in the automatic translation and dispatch mechanism. Types which are basic python types (like int, str or tuple) represent "scalars" which will be passed by value without conversion.

Worked example

Let's use Graph and WeightedGraph to see how this works.

Abstract types

class GraphType(AbstractType):
    '''A graph is a collection of nodes and edges that connect nodes.'''
    pass  # nothing more to specify

class WeightedGraphType(GraphType):
    '''A graph that specifies a numeric weight value for each edge'''
    # a weighted graph can be converted to a graph, but a graph 
    # cannot be converted to a weighted graph
    pass

Note that abstract types are basically just a class and a docstring, with inheritance showing how things might be related to each other.

Wrapper classes

An instance of networkx.DiGraph meets our requirement for Graph but not WeightedGraph. For a weighted graph, we will need to define an extra wrapper class to carry the attribute name of the weight.

class NetworkXWeightedGraph:
    def __init__(self, graph, weight_label):
        self.graph = graph
        self.weight_label = weight_label
        assert isinstance(graph, nx.DiGraph)
        assert (
                weight_label in graph.nodes(data=True)[0]
        ), f"Graph is missing specified weight label: {weight_label}"

Concrete types

Now we need some types to describe the NetworkX instances above. Let's assume our base ConcreteType looks like this:

class ConcreteType:
    abstract = None  # must override this
    allowed_props = frozendict()  # default is no props
    target = 'cpu'  # key may be used in future to guide dispatch 

    def __init__(self, abstract, **props):
        self.abstract = abstract
        for key in props:
            if key not in allowed_props:
                raise KeyError(f'{key} not allowed property of {self.__class__}')
            # maybe type check?
        self.props = dict(props)  # copying to be paranoid

    def is_satisfied_by(self, other_type):
        # check if other_type is at least as specific as this one
        if isinstance(other_type, self.__class__):
            for k in self.props:
                if self.props[k] != other_type.props[k]:
                    return False
        return True

    def __eq__(self, other_type):
        return isinstance(other_type, self.__class__) and \
            self.props == other.props

    def isinstance(self, obj):
        # Return True if obj is an object described by this Concrete Type
        raise NotImplementedError()

    def get_props(self, obj):
        # Return a dict of properties that this object satisfies
        raise NotImplementedError()

The type for the NetworkX graph is:

class NetworkXGraphType(ConcreteType):
    abstract = GraphType
    value_class = nx.DiGraph
    allowed_props = frozendict({
        # placeholders for now
        'foo': bool,
        'bar': int,
    })

    def isinstance(self, obj):
        # default implementation means this doesn't have to exist unless you need to programmatically identify values of this type

And the weighed graph is:

class NetworkXGraphType(ConcreteType):
    abstract = WeightedGraphType
    value_class = NetworkXWeightedGraph
    allowed_props = frozendict({
        # placeholders for now
        'baz': str,
    })

Translator

A translator is a function that takes a value of one concrete type and maps it to a value of another concrete type (optionally with the desired type properties asserted). A translator might look like this:

@metagraph.translator
def nx_to_cugraph(src: NetworkXGraphType, **props) -> CuGraphType:
    # insert implementation here

For simplicity of dispatch, a translator must be able to handle all properties of both the source and destination concrete type. The decorator is used to add any additional methods or attributes to the functions that the system will find useful. Note that the decorator does not record this function in any global registry (see below).

Note that if a concrete type has properties, it is necessary to define a "self-translator", which is used translate the value into one with the required properties:

@metagraph.translator
def nx_to_nx(src: NetworkXGraphType, **props) -> NetworkXGraphType:
    # insert implementation here

The @metagraph.translator decorator turns the function into a callable object with additional properties:

src_type: ConcreteType class of source
dst_type: ConcreteType class of destination

Abstract Algorithm

Abstract Algorithms are just Python functions without implementations that have a type signature that includes Abstract Types. For example, the Louvain community detection might look like this:

from typing import List

@metagraph.abstract_algorithm(name='community.louvain')
def louvain(graph: GraphType) -> List[GraphType]):
    '''Return the louvain subgraphs'''
    pass

As with the translators, the decorator is used to add useful methods and attributes to the function, as we will see below.

Concrete Algorithm

Concrete algorithms look like the abstract algorithm, but use concrete types:

@metagraph.concrete_algorithm('community.louvian')
def nx_louvain(graph: NetworkXGraphType) -> List[NetworkXGraphType]:
    # insert implementation here

Note that this decorator does not record the nx_louvain method in a registry hidden inside of the abstract louvain algorithm. Instead it converts the function into a callable class with attributes like:

nx_louvain.abstract_algorithm: Reference back to the louvain object.
nx_louvain.check_args(*args, **kwargs): Check if argument list matches function signature.

If we want to define a concrete algorithm that only accepts values with a particular property (allowed properties are enumerated in the concrete type), we can do that this way:

@metagraph.concrete_algorithm('community.louvain')
def special_nx_louvain(graph: NetworkGraphXType(foo=True, bar=4)) -> List[NetworkXGraph(foo=True)]:
    # insert implementation here

This requires the input graph to have both the property of foo=True and bar=4, and asserts that the return value has property foo=True, but nothing else.

Registration

For both testing purposes, as well as creation of special contexts, we will want to encapsulate the state associated with the registry of types, translators and algorithms. We call this state a Resolver, and it is responsible for:

Managing a registry of abstract types, concrete types, translators, abstract algorithms, and concrete algorithms.
Finding a sequence of translators to get between two compatible concrete types
Selecting a concrete algorithm based on input types and user-specified heuristics.

There will be an implicit, global Resolver created by metagraph when imported that is populated by all of the plugins in the environment. Empty resolvers can also be created and populated manually.

The Resolver class will have methods like this:

class Resolver:
    def register(
        self,
        *,
        abstract_types: Optional[List[AbstractType]] = None,
        concrete_types: Optional[List[ConcreteType]] = None,
        translators: Optional[List[Translator]] = None,
        abstract_algorithms: Optional[List[AbstractAlgorithm]] = None,
        concrete_algorithms: Optional[List[ConcreteAlgorithm]] = None,
    ):
        pass

    def load_plugins(self):
        '''Populate registries with plugins from environment'''
        pass

    def typeof(self, obj):
        '''Returns fully specified concrete type of obj'''
        pass

    def convert_to(self, src_obj, dst_type, **props):
        '''Converts src_obj to instance of dst_type with given properties'''
        pass

    def match_algo(self, abstract_algo, arg_types, kwarg_types):
        '''Returns concrete algorithm that matches the given abstract
        algorithm and args/kwargs'''
        pass

As a convenience, the resolver can also dynamically generate the algorithm namespace below it. Ex:

res = Resolver()
res.load_plugins()

# dispatch and call immediately
mygroups = res.algo.community.louvain(mygraph)

# pick the concrete algo and return it
louvain_func = res.algo.community.louvain.match(mygraph)

ConcreteType and AbstractType need a `repr`

Both concrete and abstract types should have a __repr__ that returns something like MyType(prop1=True, prop2='bar'). This would make debugging much easier.

CompileError should inherit from MetagraphError

Also move CompileError to exceptions.py

Union types in signatures

Union types are needed is certain cases. For example:

BuildGraph(Union[EdgeSet, EdgeMap], Union[NodeSet, NodeMap, None]) -> Graph
NodeSelect(Union[NodeMap, NodeTable], NodeSet) -> Union[NodeMap, NodeTable]
DropNodes(Union[NodeMap, NodeSet], NodeSet) -> Union[NodeMap, NodeSet]

The common thread for these methods is that a variety of different types can be used for the inputs, but we never want to lose the information being passed in. Some of these can be thought of similar to Generics in other languages.

This is in contrast to something like:

TriangleCount(EdgeSet) -> Graph

For TriangleCount, we could pass in an EdgeMap or a Graph and auto-translate into an EdgeSet because all we need for the algorithm is the data contained in the EdgeSet. Everything else is ignored. For this case, we would not use a Union.

Key Takeaway:
Use a Union when multiple types are allowable, but metagraph should not auto-translate between the types. If you want metagraph to auto-translate between types, simply indicate the final type in the signature.

Additional Note about Optional type:
Optional[int] becomes a Union behind the scenes in the Python typing system. It becomes Union[int, NoneType]. Metagraph should never attempt to translate between a type and NoneType, so this still fits the takeaway of never translating between types for Union types.

Rules to follow in code:

If a signature type is a single AbstractType, metagraph will translate within and across types as allowed by unambiguous_subcomponents
If a signature type is Union[type, NoneType], we assume the user wanted this to be optional, so metagraph follows Rule 1 and additionally allows a None to be passed in.
If a signature type is Union[type1, type2, ...], we allow any concrete type matching one of those types. Metagraph will translate within a type, but never across types.
If a signature type is Union[type1, type2, NoneType], we follow Rule 3 and additionally allow a None to be passed in.

PR#215 uses an old version of dask.

#215 uses an old version of dask.

Ideally, we'd use the latest version of dask.

What is a node?

Before we dig deeply into the base types, let's define what a node is.

networkx lets a node be any hashable object. Each node must be unique within a graph. Let's call this Hashable.
graphblas requires nodes to be sequential uints starting from 0. Let's call this Sequential.

These seem to be the most common definitions of nodes used by various libraries. Sometimes there will be a concept of a node label. igraph uses Sequential nodes, but lets you add a "label" attribute to each node, which needs to be a string. Most methods and algorithms which expect a node let you pass in either the uint or the string and it figures things out.

I see two approaches to handling the difference in node expectations as we convert between different library representations of Graph and Nodes.

Approach 1

Require all nodes to be Sequential. When this is not a natural constraint in a library (like networkx), we create a copy and assign each node a sequential number in init. We still keep the mapping around for reference, but that mapping is not passed around.

Approach 2

Give every node a label. For graphs with Sequential nodes, the uint number is the label. Maintain a mapping between node label and node position and pass this mapping around. This allows each backend library to use either the label or position as appropriate for their implementation.
Note: this is our current approach with node_index

Don't use dask delayed for regular python type return values

There is a branch that builds Dask tasks for algorithms in one of two ways, depending on whether the return value is a registered concrete type, or a python type:

https://github.com/metagraph-dev/metagraph/blob/main/metagraph/core/dask/resolver.py#L215-L230

This doesn't seem to be necessary and will make visualization and analysis harder. The result should always be a Placeholder object.

Revisit basic types

Type	Description	Characteristics (guidance)	Abstract Properties	Required Wrapper Methods/Properties
node_id	- uint - not really a type, more of a concept
Enum or Categorical?	- 1:1 mapping from str to uint - can be used to label nodes using strings	- O(1) lookup in either direction
Vector	1-D values	- single dtype - no missing values allowed - O(1) lookup by position - iteration order guaranteed	dtype
SparseVector	Vector with missing values	- single dtype - missing values allowed - O(1) lookup by position - O(1) test for emptiness by position	dtype
Matrix	2-D values	- single dtype - no missing values allowed - O(1) lookup by position	dtype, is_square
SparseMatrix	Matrix with missing values	- single dtype - missing values allowed - O(1) lookup by position - O(1) test for emptiness by position	dtype, is_square
DataFrame	- 2-D table - each column has a unique string label - each column has a single dtype
NodeSet	a set of nodes	- iteration order not guaranteed - O(1) indication of node inclusion		`num_nodes`, `__contains__`
NodeMap	a node:value mapping	- iteration order not guaranteed - O(1) lookup by node - single dtype - no missing values	dtype	`num_nodes`, `__getitem__`, `__contains__`
NodeTable	represents a set of nodes, with multiple values for each node; think of it as a DataFrame with a row index of nodes and a column for each property	- each property has a single dtype - each property has a unique string label - values are allowed to be empty		`num_nodes`, `num_properties`
Graph	a set of nodes plus an edge:value mapping	- iteration order not guaranteed - O(1) lookup for node inclusion - O(1) lookup for edge value - O(1) lookup for edge presence	dtype, is_directed, is_weighted, has_negative_weights	`num_nodes`, `num_edges`
BipartiteGraph	two distinct sets of nodes with edges allowed only between the sets; edges have values	- each node set is logically distinct, but not necessarily numerically distinct; meaning node 1 in set A is different than node 1 in set B - single dtype for edge values - edges are undirected	dtype, is_weighted	`num_A_nodes`, `num_B_nodes`, `num_edges`
PropertyGraph	multiple graphs with the same nodes, each graph represents a unique property; each property has a unique set of edge weights	- each graph has a single dtype - each graph has a string label associated with the property	is_directed	`num_nodes`, `num_properties`

Additional things to consider

Which types require a Wrapper? Probably all the Nodes and Graph variants
Should we formalize the hierarchy of abstract properties?
- is_weighted governs whether dtype has any meaning
- dtype governs whether has_negative_weights has any meaning

Make Auto-ConcreteTypes less magical

A Wrapper auto-creates a ConcreteType pointing to itself. In the process, it moves some methods from the Wrapper to the ConcreteType. This is a very magical process which I want to make more transparent to the user.

Current Approach

Encode the hard-coded list of methods, classmethods, and staticmethods in Wrapper.__init_subclass__.

Proposal

Require all "auto-move" methods to be decorated. The decorator will flag the methods for move, eliminating the need for a hard-coded list of methods.

An additional benefit is that any method can be moved, allowing for more flexibility of helper methods in the ConcreteType.

It also makes the "auto-move" behavior more explicit to the user.

One question I haven't settled on is what to name the decorator:

class MyNewGraph(GraphWrapper, abstract=Graph):

    # Option 1
    @Wrapper.type_method
    @classmethod
    def _compute_abstract_properties(cls, ...):
        pass

    # Option 2
    @ConcreteType.method
    @classmethod
    def _compute_abstract_properties(cls, ...):
        pass

    # Option 3: ConcreteType has a `method`, `classmethod`, and `staticmethod` decorator
    @ConcreteType.classmethod
    def _compute_abstract_properties(cls, ...):
        pass

I think I like Option 3, although it might throw off syntax highlighting in certain IDEs because we're not using proper classmethod and staticmethod decorators. Instead, we add those decorators at "move" time.

Metagraph isn't including *.html files

The Metagraph conda recipe is not including the *.html files required to make the explorer work.

It should also add websockets as a dependency, as it is required for the explorer.

Translators don't know about enough numpy dtypes

For example,

mg.translate(
    ss.csr_matrix([[1, 0], [0, 0]]),
    mg.resolver.types.SparseMatrix.GrblasMatrixType
)

complains about not finding numpy.longlong. There may be other datatypes that we should be able to handle, but don't.

Discussion: Dask API

Goals

The Dask API for Metagraph needs to:

Create lazy versions of dispatchable types and wrappers (like Graph, EdgeList, etc) that behave like other Dask data types
Expose lazy wrappers around algorithms that can take either lazy types or regular types as inputs and return lazy types as outputs.
Expose lazy wrappers around translations as well

Unlike other dask-aware data types, we are not looking to make Graphs, etc partitioned, as we are deferring any choices about parallel data layouts to the backend which likely needs to communicate in a lower overhead, more tightly coupled way (like MPI). However, we also will want to enable translators (and maybe some algorithms) that take in Dask partitioned data structures (like dask.dataframe and dask.array) as inputs and do the conversion in a parallel fashion if possible. Example here would be taking a Dask dataframe edge list loaded from many worker nodes and turning it into a Graph in a backend that has some kind of distributed shared memory. We don't need to address this use case immediately, but we should keep it in mind.

Example Syntax

This is a rough sketch of how this could look to start discussion:

import dask.dataframe as dd
import metagraph.dask as dg  # for "dask graph"

passenger_flow_df = dd.read_csv("airline.csv")
# do some data prep here

passenger_flow_edge_map = dg.EdgeMap.PandasEdgeMap(passenger_flow_df,
                                                           'ORIGIN_CITY_MARKET_ID',
                                                           'DEST_CITY_MARKET_ID',
                                                           'INVERSE_PASSENGER_COUNT',
                                                           is_directed=True)

passenger_flow_graph = dg.util.graph.build(passenger_flow_edge_map)
betweenness_centrality = dg.centrality.betweenness(passenger_flow_graph, normalize=False)

number_of_best_scores = 15
best_betweenness_centrality_node_vector = dg.util.nodemap.sort(betweenness_centrality, ascending=False, limit=number_of_best_scores)
best_betweenness_centrality_node_set = dg.util.nodeset.from_vector(best_betweenness_centrality_node_vector)
best_betweenness_centrality_node_to_score_map = dg.util.nodemap.select(betweenness_centrality, best_betweenness_centrality_node_set)
best_betweenness_centrality_node_to_score_map

best_betweenness_centrality_node_to_score_map = dg.translate(best_betweenness_centrality_node_to_score_map, dg.NodeMap.PythonNodeMap)

print(best_betweenness_centrality_node_to_score_map.compute().value)

Ideally, plugin authors would write no extra code to enable the above. If a translator or algorithm returns a Dask task graph rather than a bare object, the system should not attempt to wrap the DAG in another Dask object, but just pass it through unchanged. This will allow concrete algorithms and translators that take Dask inputs to return Dask outputs that respect the partitioning of the input.

mg.types.EdgeMap(is_directed=True) is not a class, but the printing of it states otherwise

I encountered this strange phenomenon:

>>> import metagraph as mg
>>> type(mg.types.EdgeMap(is_directed=True))
<class 'metagraph.types.EdgeMap'>
>>> mg.types.EdgeMap
<class 'metagraph.types.EdgeMap'>
>>> issubclass(mg.types.EdgeMap(is_directed=True), mg.types.EdgeMap)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: issubclass() arg 1 must be a class
>>> 
>>> import inspect
>>> inspect.isclass(mg.types.EdgeMap(is_directed=True))
False

This is confusing.

If mg.types.EdgeMap(is_directed=True) is a class, then ideally we wouldn't get the errors above.

If mg.types.EdgeMap(is_directed=True) is not a class, then ideally type(mg.types.EdgeMap(is_directed=True)) would print something less misleading.

entry points regression regarding `plugin1`

As of #11, it appears to me that the entry points machinery no longer finds plugin1.

See metagraph/tests/test_resolver.py::test_load_plugins for test that currently fails.

I recommend applying #17 first.

Auto-register default plugins in dev environment

When using metagraph locally in a dev environment, I need to do this

import metagraph as mg
from metagraph import default_plugins
mg.resolver.register(**default_plugins.find_plugins())

because we don't have entry points yet.

It would be nice this could be detected and register the default plugins anyway.

Unifying ConcreteType and Wrapper classes

I know I keep spinning on this, but I think I finally have a solution.

Separating ConcreteTypes and Wrapper classes initially appears to make sense because we need to specialize each ConcreteType with properties when writing algorithm signatures, but we don't have the underlying data yet.

I think there is a different way to still get easy parameterization within the function signatures, but allowing ConcreteTypes and Wrapper classes to merge into the same class. They essentially already have a 1:1 relationship, so I feel like they should not be separated.

Ideal Example (in today's codebase)

As an example, let's look at the Graph abstract type. nx.DiGraph can fully satisfy this abstract class, so we can create:

class NxGraph(ConcreteType):
    abstract = Graph
    value_class = nx.DiGraph

While this seems ideal, it still comes with the drawback that concrete algorithms are spec'd as:

def NxAddRandomEdges(g: NxGraph) -> NxGraph:
    # assume g is an instance of nx.DiGraph, not NxGraph -- potential source of confusion
    # add random edges where they don't currently exist
    return new_g  # of type nx.DiGraph

Your IDE will complain that new_g is of type nx.DiGraph, while the signature wants NxGraph.

When you use the library, you simply write:

g = nx.DiGraph()
add_starting_edges(g)
g2 = mg.algo.AddRandomEdges(g)

This is ideal in the world where the raw object fully satisfies the abstract type and there are no properties. Unfortunately, things get more complicated.

Less than ideal example (in today's codebase)

Now take the case where the raw object doesn't fully satisfy things: scipy.sparse.csr_matrix

class ScipyAdjacencyMatrix:
    def __init__(self, mat, transposed=False):
        self.obj = mat
        self.transposed = transposed

class ScipyAdjacencyMatrixType(ConcreteType):
    abstract = Graph
    value_class = ScipyAdjacencyMatrix

And now concrete algorithms look like this:

def ScipyAddRandomEdges(g: ScipyAdjacencyMatrixType) -> ScipyAdjacencyMatrixType:
    # assume g is of type ScipyAdjacencyMatrix -- confusing
    # add random edges where they don't currently exist
    return ScipyAdjacencyMatrix(mat_with_new_edges)

Your IDE will again complain about the type mismatch in the signature

Example with properties (in today's codebase)

Assume adjacency vs incidence were a property

class ScipyMatrix:
    def __init__(self, mat, kind='adjacency', transposed=False):
        self.obj = mat
        self.kind = kind
        self.transposed = transposed

class ScipyMatrixType(ConcreteType):
    abstract = Graph
    allowed_props = {'kind'}
    value_class = ScipyMatrix

Now things can get more interesting with algorithms:

def ScipyAddRandomEdges(g: ScipyMatrixType(kind='adjacency')) -> ScipyMatrixType(kind='adjacency'):
    # assume g is of type ScipyMatrix with kind=adjacency
    # add random edges where they don't currently exist
    return ScipyMatrix(mat_with_new_edges, kind='adjacency')

Here we see the utility of a separate ScipyMatrixType which can be called with kind='adjacency' completely separately from the actual ScipyMatrix class.

When you use the library, you write:

mat = scipy.sparse.csr_matrix((data, (rows, cols)))
g = ScipyMatrix(mat, kind='adjacency')  # Have to know about this specific class
g2 = mg.algo.AddRandomEdges(g)

My Proposal

Make all wrapper classes extend ConcreteType
Add a .with() method on ConcreteType which builds and returns a specialized subclass
Repeated calls to .with() using the same arguments will return the same object. i.e. it only creates parameterized subclasses when seeing a new set of property values for the first time
Erik informed me with isn't a valid method name. Bummer. Maybe go with .using() or .make() or .set() or .require()?
Wrapper classes specify their underlying raw object type (e.g. nx.DiGraph)
Abstract classes can be called with raw objects and **props to automatically create the correct concrete type
Multiple concrete types for a single abstract type cannot declare the same raw object
e.g. PythonSparseVectorType1(uses dict) and PythonSparseVectorType2(uses dict) is not allowed because calling Vector(some_dict) can't disambiguate which concrete type to instantiate. Properties would be needed, with a single PythonSparseVector class that takes type=1 or type=2 as a property.

Let's work through all the same examples as above.

Ideal Example (new proposal)

nx.DiGraph can fully satisfy the Graph abstract class.

class NxGraph(ConcreteType):
    abstract = Graph
    value_class = nx.DiGraph

    def __init__(self, g: nx.DiGraph):
        self.obj = g

Now we have a concrete algorithm:

def NxAddRandomEdges(g: NxGraph) -> NxGraph:
    # assume g is an instance of NxGraph, must retrieve graph using .obj
    # add random edges where they don't currently exist
    return NxGraph(updated_nx_graph)

Your IDE no longer complains about the signature.

When you use the library, you simply write:

g = nx.DiGraph()
add_starting_edges(g)
g2 = mg.algo.AddRandomEdges(Graph(g))

This involves wrapping nx.DiGraph in a Graph, but the user doesn't need to know about NxGraph. They can use the abstract type and it will figure it out based on registered concrete types.

Less than ideal example (new proposal)

Now take the case where the raw object doesn't fully satisfy things: scipy.sparse.csr_matrix

class ScipyAdjacencyMatrix(ConcreteType):
    abstract = Graph
    value_class = scipy.sparse.spmatrix
    
    def __init__(self, mat: scipy.sparse.spmatrix, transposed=False):
        self.obj = mat
        self.transposed = transposed

No need to define two classes. And now the concrete algorithm looks like this:

def ScipyAddRandomEdges(g: ScipyAdjacencyMatrix) -> ScipyAdjacencyMatrix:
    # assume g is of type ScipyAdjacencyMatrix
    # add random edges where they don't currently exist
    return ScipyAdjacencyMatrix(mat_with_new_edges)

Type signature matches, so the IDE is happy

Example with properties (new proposal)

Assume adjacency vs incidence were a property

class ScipyMatrix(ConcreteType):
    abstract = Graph
    allowed_props = {'kind'}
    value_class = scipy.sparse.spmatrix

    def __init__(self, mat: scipy.sparse.spmatrix, kind='adjacency', transposed=False):
        self.obj = mat
        self.kind = kind
        self.transposed = transposed

Now things can get more interesting with algorithms:

def ScipyAddRandomEdges(g: ScipyMatrix.with(kind='adjacency')) -> ScipyMatrix.with(kind='adjacency'):
    # assume g is of type ScipyMatrix with kind=adjacency
    # add random edges where they don't currently exist
    return ScipyMatrix(mat_with_new_edges, kind='adjacency')

The .with(kind='adjacency') call actually creates a new parameterized subclass. Because it is still a subclass of ConcreteType, the resolver can figure out everything it needs about dispatching.

And now for the final trick, let's see how a user might call this.

mat = scipy.sparse.csr_matrix((data, (rows, cols)))
g = Graph(mat, kind='adjacency')  # Just use the abstract type and pass properties
g2 = mg.algo.AddRandomEdges(g)

Clarity of why properties exist

In addition to properties affecting dispatch, they are also required if an abstract type cannot figure out the appropriate concrete type from a raw object. Essentially, we view going from abstract to concrete as a type of dispatch, which properties affect.

As an example, look at how scipy.sparse.spmatrix is used as a Graph. It could be an adjacency matrix or an incidence matrix. Because you can't figure it out solely by inspecting the matrix, there should be a property for the user to indicate that.

Creating two separate concrete types (e.g. ScipyAdjacencyMatrix and ScipyIncidenceMatrix) won't work anymore because they share the same raw object, but neither has any properties to help disambiguate one class from the other. This technically would still be allowed, but only if users begin treating ScipyAdjacencyMatrix as a raw object and wrap a concrete type around it. There may be corner cases where that makes sense, but usually the raw objects will be well-known objects from other libraries that users are already familiar with.

Accessing node labels from a graph whose internal representation is an adjacency matrix.

ScipyEdgeSet and ScipyEdgeMap currently have a _node_list attribute that acts a map between the label of the node and index of the corresponding row/column in the SciPy matrix.

There's currently not established way of accessing this without getting ours hands dirty by using a private attribute.

Perhaps we should have some way of getting a NodeLabel out of a ScipyEdgeSet or ScipyEdgeMap? One idea would be to make it required to have a public "node_label" property or accessor that'll return a NodeLabel.

Failing test: test_hits_centrality

It looks like test_hits_centrality:

metagraph/metagraph/tests/algorithms/test_centrality.py

Lines 230 to 255 in d828260

    
           def test_hits_centrality(default_plugin_resolver): 
        
               dpr = default_plugin_resolver 
        
               graph = build_standard_graph() 
        
               hubs = { 
        
                   0: 1.0693502568464412e-135, 
        
                   1: 0.0940640958864079, 
        
                   2: 0.3219827031019462, 
        
                   3: 0.36559982252958123, 
        
                   4: 0.2183519269850825, 
        
                   5: 1.069350256846441e-11, 
        
                   6: 1.451486288792823e-06, 
        
                   7: 0.0, 
        
               } 
        
               authority = { 
        
                   0: 0.014756025909040777, 
        
                   1: 0.2007333553742929, 
        
                   2: 1.5251309332182024e-06, 
        
                   3: 1.2359669426636484e-134, 
        
                   4: 0.35256375000871987, 
        
                   5: 0.2804151003457033, 
        
                   6: 1.2359669426636479e-11, 
        
                   7: 0.15153024321895017, 
        
               } 
        
               MultiVerify(dpr).compute( 
        
                   "centrality.hits", graph, maxiter=100, tolerance=1e-06 
        
               ).assert_equal((hubs, authority), rel_tol=1e-3)

is now failing. I get:

FAILED metagraph/tests/algorithms/test_centrality.py::test_hits_centrality - AssertionError: Mismatch for node 0: 1.0693502568464412e-135 != -1.375893824035368e-19

This is with NetworkX 2.6.2 on macOS x86-64. The concrete algorithm is here:

metagraph/metagraph/plugins/networkx/algorithms.py

Lines 201 to 206 in 7f316d1

    
           @concrete_algorithm("centrality.hits") 
        
           def nx_hits_centrality( 
        
               graph: NetworkXGraph, maxiter: int, tolerance: float, normalize: bool, 
        
           ) -> Tuple[PythonNodeMapType, PythonNodeMapType]: 
        
               hubs, authority = nx.hits(graph.value, maxiter, tolerance, normalized=normalize) 
        
               return hubs, authority

Idea: Spin out data translation and dispatch core

Buried inside metagraph is a generic system that allows for pluggable abstract types, translators, and dispatch (with optional automatic translation). There's nothing really graph specific about this system, so it could be spun out into its own package, with metagraph being condensed to just the abstract types and algorithms for graph analytics.

I think it makes sense to wait on this until metagraph is more mature and we more fully explore the translation and dispatch feature space. Once that core is more mature, a spin out could make sense if there is interest. (Feel free to watch this issue to be notified when this happens.)

Alternate Graph Node Embedding?

Save reference to Resolver in ConcreteAlgorithm?

I noticed that we copy ConcreteAlgorithms when registered in the resolver:

https://github.com/metagraph-dev/metagraph/blob/main/metagraph/core/resolver.py#L572-L573

This means we could, at time of registration, set an attribute on the ConcreteAlgorithm to the resolver without causing any issue if there are for some reason two resolvers with the same algorithm registered to both? Having that reference would simplify the logic here:

https://github.com/metagraph-dev/metagraph/blob/main/metagraph/core/plugin.py#L618-L628

As it would no longer require the resolver to be passed as an explicit argument. This is the trick I used in #166, but I would like to generalize it, if it makes sense.

@jim22k: Thoughts?

Writing Concrete Algorithms -- questions

While beginning to write real concrete algorithms, I came up with a few questions.

What does everyone think about
```
@abstract_algorithm('link_analysis.pagerank')
def pagerank(...):
    pass

@concrete_algorithm(pagerank)
def nx_pagerank(...):
    return ...
```
Currently, the decorator must be @concrete_algorithm('link_analysis.pagerank'),
which is nice when the concrete algorithm and abstract are in separate modules.
But sometimes they're right next to each other. Both forms are readable. I suggest we
support both styles.
Where do default values belong?
```
@abstract_algorithm('link_analysis.pagerank')
def pagerank(g: Graph, damping: float = 0.85) -> DenseVector:
    pass

@concrete_algorithm('link_analysis.pagerank')
def nx_pagerank(g: nx.DiGraph, damping: float = 0.75) -> NumpyVector:
    return ...                                # ^^^^ allow this?
```
Defining the defaults in the abstract definition seems best so that switching paths doesn't
accidentally change the call -- for example, if nx_pagerank had a different damping than
cugraph_pagerank. I wouldn't want the damping factor to change just because the solver
decided to use the GPU and the user didn't specify what damping factor to use.

Visual exploration

Given the resolver solves paths through the landscape of types, translators, and algorithms, it would be nice to visualize that landscape for users. This would show both the number of types as well as completeness of translator pathways.

Brainstorming with @paul-tqh-nguyen, we decided to show:

plugins
types
translators
algorithms

We also want to have solvers for translation paths and algorithm choice.

Paul will focus on front-end development. The translator page will likely use d3.js. Other pages will be more of filterable tree widgets. The whole thing will likely run as a single-page application with dynamic calls via websockets and JSON data being passed back and forth.

Jim will focus on the back-end development. The structure will likely be a simple websocket design using async calls. It won't be as performant as a real webserver, but it should only have a single client hitting it, so it will suffice. The goal will be to write a temporary webpage and open a browser tab pointing to the temp file. Once the page is closed, the websocket connection will disconnect and the server process will terminate. This will give enough dynamic nature to avoid duplicating logic in javascript without the complexity of a full framework like flask or django.

Methods on objects

With abstract properties, we require that increasing specificity of properties be done via an algorithm.

For example, going from undirected=False to undirected=True, there might be an algorithm called treat_edges_as_undirected(Graph(undirected=False)) -> Graph(undirected=True). This allows the user to be explicit when information is dropped.

However, an algorithm drop_weights could work for both a Graph and a Mapping, treating all weights as a value of 1. Currently, there is no way in metagraph to define an algorithm with polymorphic signature.

A better approach in this case is to have a drop_weights method on the objects, i.e. my_graph.drop_weights() or my_mapping.drop_weights(). This only works for algorithms which take and return the same concrete type.

Considerations

All concrete types would need to implement the method. Otherwise, the method becomes unavailable. We definitely don't want to call my_scipy_adj_matrix.drop_weights() and have metagraph convert to networkx, drop the weights, and convert back to scipy adjacency matrix. I can't imagine a case where that isn't surprising to a user.

The actual implementation could be as an algorithm in the normal sense with the method being a special handling by metagraph. Or, we could have users write it as an actual method on their concrete type. I'm more inclined to go with the latter.

Error in find_translation_path

@jim22k, I sometimes get this failure when running tests on master:

_________________________________________________________ test_python _________________________________________________________

default_plugin_resolver = <metagraph.core.resolver.Resolver object at 0xa1c9e6550>

    def test_python(default_plugin_resolver):
        dpr = default_plugin_resolver
        sparse_dict = {1: 1.1, 5: 5.5, 2: 2.2}
        x = PythonSparseVector(sparse_dict, size=30)
        assert len(x) == 30
        # Convert back and forth from numpy sparse array
        nps = dpr.translate(x, NumpySparseVector)
        assert isinstance(nps, NumpySparseVector)
>       y = dpr.translate(nps, PythonSparseVector)

metagraph/tests/translators/test_sparsevector.py:15:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
metagraph/core/resolver.py:338: in translate
    translator = self.find_translator(value, dst_type)
metagraph/core/resolver.py:325: in find_translator
    ret_val = self.find_translation_path(src_type, dst_type)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <metagraph.core.resolver.Resolver object at 0xa1c9e6550>
src_type = <class 'metagraph.default_plugins.wrappers.numpy.NumpyVectorType'>
dst_type = <class 'metagraph.default_plugins.wrappers.python.PythonSparseVectorType'>

    def find_translation_path(self, src_type, dst_type) -> Optional[Translator]:
        import scipy.sparse as ss

        abstract = dst_type.abstract
        if abstract not in self.translation_matrices:
            # Build translation matrix
            concrete_types = [
                ct for ct in self.concrete_types if issubclass(ct.abstract, abstract)
            ]
            m = ss.dok_matrix((len(concrete_types), len(concrete_types)), dtype=bool)
            for s, d in self.translators:
                try:
                    sidx = concrete_types.index(s)
                    didx = concrete_types.index(d)
                    m[sidx, didx] = True
                except ValueError:
                    pass
            sssp, predecessors = ss.csgraph.dijkstra(
                m.tocsr(), return_predecessors=True, unweighted=True
            )
            self.translation_matrices[abstract] = (concrete_types, sssp, predecessors)
        # Lookup shortest path from stored results
        concrete_types, sssp, predecessors = self.translation_matrices[abstract]
>       sidx = concrete_types.index(src_type)
E       ValueError: <class 'metagraph.default_plugins.wrappers.numpy.NumpyVectorType'> is not in list

importlib_metadata=3.7.3 breaks test_load_duplicate_name

Something changed in importlib_metadata=3.7.3 (previously 2.0.0) which breaks test_load_duplicate_name. Looking at sys.path, both site_dir and bad_site_dir are present. But when grabbing entry points, only bad_site_dir is represented (along with core metagraph).

Rather than digging in, python=3.8 introduced importlib.metadata. We should use that instead. And until we stop supporting python=3.7 in metagraph, let's just pin to importlib_metadata=2.0.0 to avoid the bug.

Give `Placeholder` a `key` property

Since Placeholder objects represent their final value with a single key (like Delayed), it makes sense for them to have a key property, similar to Delayed. See:

https://github.com/dask/dask/blob/master/dask/delayed.py#L525-L527

This will make writing tests more convenient.

unit tests for MultiVerify

MultiVerify is sufficiently long and complex that it could use its own unit tests:

https://github.com/ContinuumIO/metagraph/blob/master/metagraph/tests/algorithms/__init__.py#L11-L166

Our current usage of MultiVerify only hits 58% of the lines, so many corner cases of the utility are not even exercised.

Proposal for Abstract Properties

Properties still don't feel fully fleshed out in the codebase. Here are my thoughts on the direction we should go regarding properties.

I. Properties which affect computational results

Some properties affect algorithms computationally, not just in how the data is stored and accessed. For example, triangle counting only works for nondirected graphs.

Such properties should be defined on the abstract type, not on concrete types. This would allow abstract algorithms to indicate required properties which all implementations must satisfy.

Axioms of properties on abstract types

The property must apply to all concrete implementations of the abstract type
The property must have a defined "general" setting with potentially many narrower levels
Changing from specialized to general should not change the data in any material way
Moving from general to specialized may lose information. This can only be done using an
algorithm and might require a constructor.
A concrete type does not need to support all levels of specialization for each property
It must support the general case of all abstract properties
ex. nx.DiGraph does not have ordered labels, nor is it required to support them
ex. SparseAdjacencyMatrix normally doesn't hold node labels, but in order to be compatible with Graphs, the wrapper must store a mapping of node label to node index
Properties must be able to be derived, but may be provided by the user to avoid expensive validations
ex. Having the user tell use whether a large matrix is symmetric and therefore unweighted is easier than computing the transpose and checking equality.
We should probably have a mode where all abstract properties are validated. This would be useful during debug when something goes wrong. Was it an error in the algorithm or a data issue?

II. Abstract properties have levels of specialization

In the tests, we have a StrType with a lowercase property. A string with lowercase=True is still valid if the property is changed to lowercase=False because it is already in a narrower, more constrained state. Going from the more general lowercase=False to True requires changing the data.

This pattern shows up in our real types:

is_dense (Vector, Matrix)
Any dense vector can always become a fully-populated sparse vector, but any random sparse vector cannot become a dense vector without specifying the fill value
is_directed (Graph)
Undirected graphs can become directed by adding an edge in both directions. No information is lost. The reverse (going from directed to undirected) will likely lose information.
weights (Graph)
Weights span the spectrum from accepting any weight (most general) -> non-negative -> positive -> unweighted (narrowest). Going from any weight to unweighted will lose data, but unweighted can always be interpreted as having weight=1
is_symmetric (Matrix)
is_square (Matrix)
dtype (Vector, Matrix, Graph)
str (general) -> float -> int -> bool (specialized)

III. Concrete properties are reversible

A property like is_transposed on an AdjacencyMatrix is easily reversible. Dataframe labels are also reversible. If the current src_label='SOURCE', it's easy to request the src_label='SRC' and then reverse the process back to 'SOURCE'.

These properties do not affect the computation. They will affect the specific algorithm, but changing the representation of the data never loses information.

IV. Properties on Concrete Types

Any property which doesn't fit the requirements of an abstract property must be a concrete property, defined separately for each concrete type.

Axioms of properties on concrete types

The property only applies to a specific concrete type
The property must be roundtrip-able
The name of a concrete property may not shadow the name of a property of the abstract type
Properties are not required to be derivable. Some may only be known to the user and passed in to the constructor.

V. Fewer required abstract types

We already have the concept of subclassed abstract types, like WeightedGraph vs Graph. This was done because we recognized that weight can go one way, but is not reversible. Abstract properties generalize this finding, so we may not need abstract type subclassing.

With abstract properties, I think we only need a few abstract types now (listed along with their properties):

Vector (is_dense, dtype)
Node (weights, dtype)
NodeMapping
Matrix (is_dense, dtype, is_square, is_symmetric)
Dataframe
Graph (weights, dtype, is_directed)

VI. Translators

Translators can easily deal with concrete properties because they are reversible. Self-translators only deal with concrete properties.

Abstract properties are more nuanced when it comes to translators. In general, translators should avoid changing abstract properties when converting from one concrete type to another. If that is not possible because the destination type does not support the source's level of specificity, the translator should generalize that abstract property to the maximum allowed by the destination. Doing so will not lose any information. However, the reverse should never be done. A translator may not increase the specificity of an abstract property. Such an action requires an algorithm or possibly a constructor.

Rules for translators

Abstract properties should not change unless absolutely necessary due to specificity limits of the destination type. In that case, the property should be made only as general as required.
Abstract properties can never be made more specific via a translator.
Translators must deal with all concrete properties.
Self-Translators only deal with concrete properties.

Handle passing literal values to compiler

There are some unused parameters in the compiler APIs to pass literal values (i.e. values that should be frozen at compile time and can be used to influence code generation) to the compiler. These eventually need to be wired up after #173, which will make the distinction between compile-time literal values (which won't be wrapped in a placeholder class) and runtime values (which will be wrapped in placeholder classes) more clear.

(Eventually we will need some additional logic to allow things like a trained graph embedding to be treated as a literal so it can be fused with another ML algorithms.)

Metagraph explorer is broken

Concrete types without a value_type like NumpyMatrixType and NumpyVectorType are breaking the explorer code.

metagraph/explorer/api.py:170

AttributeError: 'NoneType' object has no attribute '__module__'

Enhancements for MultiVerify

MultiVerify (defined in the testing framework for algorithms), has two methods:

assert_equal
custom_compare

The first handles translation from different return types and does an exact comparison. The challenge is that many algorithms are not deterministic, which is a requirement for exact comparison of outputs.

The second is fully flexible, but puts the burden on the user to compare everything.

With the addition of utility functions, many nondeterministic results can be made deterministic. For example, in max flow we know which nodes are the bottlenecks and doing an edge aggregation followed by a node selection, we have a deterministic result to compare.

I propose adding enhancements to MultiVerify to allow calls something like this:

mv = MultiVerify(resolver)
results = mv.compute("flow.max_flow", graph)  # returns (flow_rate, graph_with_actual_flow_on_edges)

# Compare flow rate
results[0].assert_equal(6)

# Normalize actual flow to prepare to transform
actual_flow = results[1].normalize("NetworkXGraphType")

# Compare sum of out edges for bottleneck nodes
out_edges = mv.transform(core_networkx, "util.graph.aggregate_edges", actual_flow, lambda x, y: x + y, initial_value=0)
out_bottleneck = mv.transform(core_networkx, "util.nodemap.select", out_edges, PythonNodeSet({2, 4}))
out_bottleneck.assert_equal(PythonNodeMap({2: 6, 4: 6}))

# Compare sum of in edges for bottleneck nodes
in_edges = mv.transform(core_networkx, "util.graph.aggregate_edges", actual_flow, lambda x, y: x + y, initial_value=1, in_edges=True, out_edges=False)
in_bottleneck = mv.transform(core_networkx, "util.nodemap.select", in_edges, PythonNodeSet({2, 4}))
in_bottleneck.assert_equal(PythonNodeMap({2: 6, 4: 6}))

scipy==1.6.1 breaks tests

@eriknw found this bug

Six tests are failing after upgrade to scipy==1.6.1. Some translation is converting ints to np.float64 rather than np.int64.

Dask resolver should wrap arguments of a concrete type in a placeholder

If we define some algorithms like this:

    @abstract_algorithm('testing.scale')
    def testing_scale(a: Vector, scale: float)->Vector:  #pragma: no cover
        pass

    @concrete_algorithm('testing.scale', compiler="identity")
    def compiled_scale(a: NumpyVectorType, scale: float)->NumpyVectorType:
        return a * scale

And then construct a value like this:

   a = np.arange(100)
    scale_func = res.algos.testing.scale
    z = scale_func(scale_func(scale_func(a, 2.0), 3.0), 4.0)

The resulting dask graph is this:

{('call-8b12ab9c7500e5a04d2200176747e90e', 'testing.scale'): (<metagraph.core.plugin.ConcreteAlgorithm object at 0x7ff6080720d0>,
                                                              ('call-8f1882ff20c40a1140f666ab9450cf58',
                                                               'testing.scale'),
                                                              3.0),
 ('call-8f1882ff20c40a1140f666ab9450cf58', 'testing.scale'): (<metagraph.core.plugin.ConcreteAlgorithm object at 0x7ff6080720d0>,
                                                              array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),
                                                              2.0),
 ('call-a0001325df28ad22d68449a7ebc15273', 'testing.scale'): (<metagraph.core.plugin.ConcreteAlgorithm object at 0x7ff6080720d0>,
                                                              ('call-8b12ab9c7500e5a04d2200176747e90e',
                                                               'testing.scale'),
                                                              4.0)}

In particular, notice this task:

 ('call-8f1882ff20c40a1140f666ab9450cf58', 'testing.scale'): (<metagraph.core.plugin.ConcreteAlgorithm object at 0x7ff6080720d0>,
                                                              array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]),
                                                              2.0),

has included the NumPy array (NumPyVectorType) in the arguments for the task.

For various reasons, I think we want to actually wrap that array in a placeholder object as a separate task in the graph and then pass that key to the concrete algorithm. In particular, we will eventually want the ability to rewrite the graph to swap out one concrete algo with another, which might necessitate a translation step. It is cleaner if the translatable data (i.e., those values represented by concrete types) have been separated out into their own entries in the task graph. Also, by wrapping the value in a placeholder, we can retain the type metadata about the value, which will also help in some graph rewrite transformations.

Catch accidental use of concrete type in abstract type signature

This should not be allowed:

from metagraph import abstract_algorithm, PluginRegistry
from metagraph.plugins.core.types import Vector
from metagraph.plugins.numpy.types import NumpyVectorType

@abstract_algorithm('testing.scale')
def testing_scale(a: Vector, scale: float)->NumpyVectorType:  # oops, bad return type
    pass

registry = PluginRegistry('test_registry')
registry.register(testing_scale)  # this should raise an error

Utility methods + callables

Some type conversions and algorithms are obvious:

Extract from NodeMap, yields a NodeSet
Filter NodeMap using NodeSet, yields a smaller NodeMap
Intersect two NodeSets, yields a smaller NodeSet
Count the number of nodes in a NodeSet/NodeMap
Count the number of edges in a Graph

Some of these are already proposed as required wrapper methods.

Other useful algorithms require a callable:

Neighbor aggregation: Reduce a Graph to a NodeMap by aggregating all out-edge weights of each node
Edge Bundling: Convert a DirectedGraph into an UndirectedGraph; alternatively, convert a MultiGraph into a Graph
NodeMap reduction: Reduce values from all nodes into a scalar value
NodeMap filter: Apply a function to all value; keep only those nodes where the function returns True

These all require a callable of some sorts to be passed. It might be as simple as max, min, or sum, but it could also be a user-defined function.

Aggregators

Aggregation (or reduction) methods generally take a binary function: lambda x, y: x + y. There are some cases where all values are needed simultaneously -- for example, to calculate the median.

Initial Proposal:

Define a set of common aggregation functions: min, max, sum, product, any, all, count
- Algorithm implementers are expected to handle these common functions
User-defined functions will be binary operators
- These will be jitted by metagraph, then passed to the concrete algorithm
- Algorithms should handle these jitted functions

Filters

Filters are more complicated that aggregators. Filters technically require a unary operator, but in practice are often binary operators with a scalar (>0, ==max_val, etc). Another common filter practice is to accept a boolean array as input.

Initial Proposal:

Define a set fo common filter functions: gt, ge, lt, le, eq, ne
Filter functions take either:
a. User-defined unary function
b. Common binary function with an extra scalar
c. Boolean object of the type NodeMap (same as object being filtered)
Create a compare method between two objects; accept common binary functions or user-defined binary functions. Could return a NodeSet instead of a boolean NodeMap.
Create a select method which takes a NodeSet as input and only keeps those nodes

Pre-dispatcher

Some of these changes require metagraph to modify inputs prior to dispatching. Jitting functions is an example of this.

Having a layer which allows the User API to differ from the concrete implementation API should be used with caution, but seems necessary and useful in several algorithms.

Make translation_matrices a dataclass

Currently the translation matrix cache in Resolver is a dictionary of tuples:

https://github.com/ContinuumIO/metagraph/blob/master/metagraph/core/planning.py#L93-L94

This could easily be made into a dataclass, possibly with the construction code as a method:

https://github.com/ContinuumIO/metagraph/blob/master/metagraph/core/planning.py#L93-L94

This would also make it possible to write unit tests just to test the translation matrix.

Dask Placeholder object needs its own visualize() method

The default visualize() method in Dask uses circles for function nodes, which creates several with our current visualization of Metagraph DAGs (described below). We could fix this with our own visualize() method, and we could also add in some special features along the way.

The problem with circles for function nodes is that severely limits the length of the label unless you want to have function nodes which take up a disproportionally large area in the drawing of the DAG. To work around this limitation, we've forced our longer labels into the rectangular data nodes, which results in a confusing diagram. Data nodes contain labels describing translations (like NumpyNodeMapType->PythonNodeMapType) or algorithm names (like util.nodemap.apply), when those labels should be on the function nodes that precede the data node.

Ideally, data nodes should be labeled with the concrete data type whenever possible. Additionally, we can add some special kwargs to our custom visualize() to do things like:

Color code all the nodes compilable by a specific plugin
Run the optimizer in read-only mode and highlight the compilable subgraphs, each in a different color
Expand the label on each node to indicate which plugin the algorithm or translator came from

We can do this customization by constructing and passing function_attributes and data_attributes dictionaries to the underlying dask visualize() method that override the default node styles, colors and labels.

We will also need to provide this visualize() method as a standalone function for use in larger DAGs where the final node is not a Metagraph placeholder object, but something else that will have the standard visualize() method.

Tracking properties on externally defined classes

In discussion of #28, we've encountered a problem that seems to be fundamentally about tracking metadata on objects defined outside of metagraph. Many of the properties that we may want to dispatch on are both:

expensive to compute
cannot be recorded for future use without modifying (or monkey-patching) the user-provided object

Recomputing some properties every time they are needed is a non-starter, and attempting to cache properties on the object itself would likely surprise the user by mutating their objects (even if in invisible ways), potentially create strange bugs, or be impossible if the object forbids adding attributes.

Workarounds include:

Forcing all objects with properties that need tracking to use a wrapper class provided by metagraph.
Only supporting properties that can be computed in O(1) time, regardless of size of the data structure.

In the case of Graph, I am leaning more toward Metagraph needing to provide a wrapper, if only to try to create some uniformity between different Graph objects. Moreso than vectors and matrices, graphs can be defined in dramatically different ways between libraries, and some amount of manual intervention from the user is needed to describe how to interpret the graph.

However, for other objects, and for general usage, another approach is to have a property cache inside the Resolver object (which is the place where all of these properties are likely to be needed). This cache can track properties of externally-defined objects as they pass through the system, avoiding the need to do expensive recomputation or to mutate the object.

The main problem to solve with an object property cache is "what to use as a hash?". Not all objects are hashable, and some may be somewhat expensive to hash. The cheapest "hash" function to use is id(obj), which in CPython returns the memory address of the PyObject struct. The danger of persisting id() in a data structure, like a cache, is that the id() can be reused during a program if the object is deleted, and a new object is created which happens to get the same location in memory. In order to avoid this problem, we can use a weakref to track if the object has been deleted since we last checked the property.

Here's a quick proof of concept of what this could look like:

import weakref

class Dummy:
    pass

class PropertyCache:
    def __init__(self):
        self._prop_cache = {}
        self._ref_check = weakref.WeakValueDictionary()
    
    def __getitem__(self, obj):
        objid = id(obj)
        if objid in self._prop_cache and \
            objid in self._ref_check:  # verify weakref still valid
            return self._prop_cache[objid]
        else:
            props = {}
            self._prop_cache[objid] = props
            self._ref_check[objid] = obj  # weak reference
            return props

prop_cache = PropertyCache()

obj = Dummy()
first_id = id(obj)
prop_cache[obj]['flavor'] = 'vanilla'
assert 'flavor' in prop_cache[obj]
assert first_id in prop_cache._ref_check  # object should be present 
del obj
assert first_id not in prop_cache._ref_check # object should be removed

A better implementation of this property cache should use the callback mechanism of weakrefs to also remove entries in self._prop_cache when objects are deleted, and also should make __getitem__() thread-safe.

I think by both having a Graph wrapper and using the above property cache to track additional attributes of third-party objects, we can resolve the concerns with properties in metagraph.

Convenience syntax for adding things to plugin registry

When writing tests, this pattern is common to create a registry with some plugins for testing:

    @abstract_algorithm('testing.scale')
    def testing_scale(a: Vector, scale: float)->Vector:  #pragma: no cover
        pass

    @concrete_algorithm('testing.scale', compiler="identity")
    def compiled_scale(a: NumpyVectorType, scale: float)->NumpyVectorType:
        return a * scale

    registry = PluginRegistry("test_subgraphs_plugin")
    registry.register(testing_scale)
    registry.register(compiled_scale)

I find myself often forgetting to actually register the plugins at the end. While there are good reasons for this format in actual plugins (which will likely use registry.register_from_modules()), I wish I had an alternate syntax like:

with PluginRegistry("test_subgraphs_plugin") as registry:
    @abstract_algorithm('testing.scale')
    def testing_scale(a: Vector, scale: float)->Vector:  #pragma: no cover
        pass

    @concrete_algorithm('testing.scale', compiler="identity")
    def compiled_scale(a: NumpyVectorType, scale: float)->NumpyVectorType:
        return a * scale

that would auto-register all the plugin-like things defined with the context manager block. Unfortunately, I think implementing this requires too much magic to behave as expected, like snapshotting locals() before and after the block to discover everything defined within it (which fails if a variable is redefined).

Since this is only for testing, perhaps a variant of register_from_modules(), like register_from_locals() which discovers everything in the local namespace of the fixture function (and is not recursive).

Convert AlgorithmPlan.display() and MultiStepTranslator.display() to .str()?

The data shown by display() on these two classes is useful. Should we make them __str__() for easier access? display() can then just be print(str(self))

Dask doesn't handle mg.List

Some embedding algorithms take a list of Graphs as input. When running using dask, these remain as Placeholders when handed to the plugin code.

	def test_hits_centrality(default_plugin_resolver):
	dpr = default_plugin_resolver
	graph = build_standard_graph()
	hubs = {
	0: 1.0693502568464412e-135,
	1: 0.0940640958864079,
	2: 0.3219827031019462,
	3: 0.36559982252958123,
	4: 0.2183519269850825,
	5: 1.069350256846441e-11,
	6: 1.451486288792823e-06,
	7: 0.0,
	}
	authority = {
	0: 0.014756025909040777,
	1: 0.2007333553742929,
	2: 1.5251309332182024e-06,
	3: 1.2359669426636484e-134,
	4: 0.35256375000871987,
	5: 0.2804151003457033,
	6: 1.2359669426636479e-11,
	7: 0.15153024321895017,
	}
	MultiVerify(dpr).compute(
	"centrality.hits", graph, maxiter=100, tolerance=1e-06
	).assert_equal((hubs, authority), rel_tol=1e-3)

	@concrete_algorithm("centrality.hits")
	def nx_hits_centrality(
	graph: NetworkXGraph, maxiter: int, tolerance: float, normalize: bool,
	) -> Tuple[PythonNodeMapType, PythonNodeMapType]:
	hubs, authority = nx.hits(graph.value, maxiter, tolerance, normalized=normalize)
	return hubs, authority