Coder Social home page Coder Social logo

benedekrozemberczki / karateclub Goto Github PK

View Code? Open in Web Editor NEW
2.1K 37.0 239.0 28.71 MB

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Home Page: https://karateclub.readthedocs.io

License: GNU General Public License v3.0

Python 100.00%
community-detection graph-clustering deepwalk networkx louvain network-science machine-learning unsupervised-learning gcn node2vec

karateclub's Introduction

Version License repo size Arxiv build badge coverage badge


Karate Club is an unsupervised machine learning extension library for NetworkX.

Please look at the Documentation, relevant Paper, Promo Video, and External Resources.

Karate Club consists of state-of-the-art methods to do unsupervised learning on graph structured data. To put it simply it is a Swiss Army knife for small-scale graph mining research. First, it provides network embedding techniques at the node and graph level. Second, it includes a variety of overlapping and non-overlapping community detection methods. Implemented methods cover a wide range of network science (NetSci, Complenet), data mining (ICDM, CIKM, KDD), artificial intelligence (AAAI, IJCAI) and machine learning (NeurIPS, ICML, ICLR) conferences, workshops, and pieces from prominent journals.

The newly introduced graph classification datasets are available at SNAP, TUD Graph Kernel Datasets, and GraphLearning.io.


Citing

If you find Karate Club and the new datasets useful in your research, please consider citing the following paper:

@inproceedings{karateclub,
       title = {{Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs}},
       author = {Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
       year = {2020},
       pages = {3125–3132},
       booktitle = {Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)},
       organization = {ACM},
}

A simple example

Karate Club makes the use of modern community detection techniques quite easy (see here for the accompanying tutorial). For example, this is all it takes to use on a Watts-Strogatz graph Ego-splitting:

import networkx as nx
from karateclub import EgoNetSplitter

g = nx.newman_watts_strogatz_graph(1000, 20, 0.05)

splitter = EgoNetSplitter(1.0)

splitter.fit(g)

print(splitter.get_memberships())

Models included

In detail, the following community detection and embedding methods were implemented.

Overlapping Community Detection

Non-Overlapping Community Detection

Proximity Preserving Node Embedding

Structural Node Level Embedding

Attributed Node Level Embedding

Meta Node Embedding

Graph Level Embedding

Head over to our documentation to find out more about installation and data handling, a full list of implemented methods, and datasets. For a quick start, check out our examples.

If you notice anything unexpected, please open an issue and let us know. If you are missing a specific method, feel free to open a feature request. We are motivated to constantly make Karate Club even better.


Installation

Karate Club can be installed with the following pip command.

$ pip install karateclub

As we create new releases frequently, upgrading the package casually might be beneficial.

$ pip install karateclub --upgrade

Running examples

As part of the documentation we provide a number of use cases to show how the clusterings and embeddings can be utilized for downstream learning. These can accessed here with detailed line-by-line explanations.

Besides the case studies we provide synthetic examples for each model. These can be tried out by running the example scripts. In order to run one of the examples, the Graph2Vec snippet:

$ cd examples/whole_graph_embedding/
$ python graph2vec_example.py

Running tests

From the project's root-level directory:

$ pytest

License

karateclub's People

Contributors

benedekrozemberczki avatar bentenmann avatar chihming avatar cyprienc avatar jamesvrt avatar kiss-oliver avatar kwan1997 avatar lucacappelletti94 avatar mef avatar nicolasdugue avatar rjurney avatar synapticarbors avatar timgates42 avatar tnto avatar tomlincr avatar whatthefuzz avatar yuvalgrossman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

karateclub's Issues

Is it possible to save embeddings for later use?

Hi

Is there any way to save the embedding for later use as a file, e.g., result.emb?

The output file has n+1 lines for a graph with n vertices. The first line has the following format:
num_of_nodes dim_of_representation

The next n lines are as follows:
node_id dim1 dim2 ... dimd

where dim1, ... , dimd is the d-dimensional representation learned by the embedding method

Ex:
34 2
33 1.4270996 -1.6193795
0 -1.2400748 -1.3776317
32 1.6967283 -1.7419897
.
.
.
11 -1.193218 -1.3673347

Thank you

Having some difficulty with embedding graphs and nodes

Good day,

I have been trying to create embeddings for nodes in a graph and for whole graphs using the FeatherNode and FeatherGraph functions.

What am I doing wrong? is it because not all the nodes in the graphs are connected? How can I get around this?

Regards
Chris

AssertionError Traceback (most recent call last)
in
1 model = FeatherNode()
----> 2 model.fit(G, expression.values)
3 emb = model.get_embedding()

~\anaconda3\envs\MIT_807_P38\lib\site-packages\karateclub\node_embedding\attributed\feathernode.py in fit(self, graph, X)
109 """
110 self._set_seed()
--> 111 self._check_graph(graph)
112 X = self._create_reduced_features(X)
113 A_tilde = self._create_A_tilde(graph)

~\anaconda3\envs\MIT_807_P38\lib\site-packages\karateclub\estimator.py in _check_graph(self, graph)
60 def _check_graph(self, graph: nx.classes.graph.Graph):
61 """Check the Karate Club assumptions about the graph."""
---> 62 self._check_connectivity(graph)
63 self._check_directedness(graph)
64 self._check_indexing(graph)

~\anaconda3\envs\MIT_807_P38\lib\site-packages\karateclub\estimator.py in _check_connectivity(self, graph)
42 """Checking the connected nature of a single graph."""
43 connected = nx.is_connected(graph)
---> 44 assert connected, "Graph is not connected."
45
46

AssertionError: Graph is not connected.

Graph Embedding with Node and Edge Features

hi,
it is an awesome tool. I have some graphs. Each node and and edge has a vector feature.
I do not sure how to use kateclub with these graphs. What is the input format of graphs?

Graph not connected false error

When using GraphWave method with the same graph works on version 0.45.10 of karateclub but using the last versions shows an error of 'Graph not connected'. In that previous version the resulting embeddings were fine and worked well in tasks such as regression with Tensorflow. So it might be an error with the new version regarding the validation of the graph.

Thanks for this awesome python library!

Connected graph constraint

It is slightly odd that many algorithms you provide demand that graphs must be connected. Could you at least provide a helper function that converts a graph to a connected one, by adding edges with weight of 0?

Missing code snippets in readthedocs notes/introduction

Under the Introduction by example section, in the API driven design paragraph, it appears that there are two code snippets missing.

The first snippet ought to follow the sentence "This API driven design means that one can create a DeepWalk embedding of a Watts-Strogatz graph just like this."

The second one ought to follow the sentence "This can be modified to create a Walklets embedding with minimal effort like this."

(Also, changing the full point to a colon would read better.)

Graph classification in batches

Hi, thanks for open sourcing this great library!

I was wondering if you have any plans to add batch support for large graph datasets that don't fit into memory? For example, FeatherGraph (along with others) could load graphs from a list of files in batches instead of all at once.

Question about overlapping community detecting algorithm.

Why the returned membership of each node is a single number, the belonging cluster is expected to an array in the sense of overlapping(each node may belong to multiple clusters). What's the difference between overlapping and non-overlapping detecting algorithms?

About GL2vec

Hello,
thanks for the awesome work!!

It seems that there are 2 mistakes in the implementation of GL2vec module.

The first one is :

in the code below,
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
def _create_line_graph(self, graph):
r"""Getting the embedding of graphs.
Arg types:
* graph (NetworkX graph) - The graph transformed to be a line graph.
Return types:
* line_graph (NetworkX graph) - The line graph of the source graph.
"""
graph = nx.line_graph(graph)
node_mapper = {node: i for i, node in enumerate(graph.nodes())}
edges = [[node_mapper[edge[0]], node_mapper[edge[1]]] for edge in graph.edges()]
line_graph = nx.from_edgelist(edges)
return line_graph
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
when converting graph G to line graph LG, the method "_create_line_graph()" ignores the edge attribute of G. (It means that there will be no node arrtibutes in LG)
so consequently, the method "WeisfeilerLehmanHashing" will not use the attribute information, and will always use the structural information (degree) instead.

The second one is :

The GL2vec module only returns the embedding of line graph.
But in the original paper of GL2vec, they concatenate the embedding of graph and of line graph.
Then named the framework "GL2vec", which means "Graph and Line graph to vector".

Only use the embedding of line graph for downstream task may lead to worse performance.

We noticed that when applying the embeddings to the graph classification task, (when graph both have node attribute and edge attribute) the performance (accuracy) are as follow:
concat(G , LG) > G > LG

Hope it helps :)

Does karateclub support an undirected weight graph?

Hello:
I'm lucky to find this great project.
I‘m solving a problem modeled as an undirected weight graph, and I read the source code of 'EgoNetSplitter', 'MNMF', 'GEMSEC', but there is no 'edge weight'.
Does karateclub support an undirected weight graph?

Graph2Vec: ValueError: Graph is not connected. Please see requirements.

Hi there,

I tried to run Graph2Vec and ensured the following is true:

  • The graph is undirected.
  • Nodes are indexed with integers.
  • There are no orphaned nodes in the graph.
  • The node indexing starts with zero and the indices are consecutive.

Yet, still get a ValueError: Graph is not connected. Please see requirements. Any insights as to what may be the problem.

I removed orphaned nodes and relabeled the indices.

G.remove_nodes_from(list(nx.isolates(G)))
mapping = {id: i for i, id in enumerate(G.nodes())}
H = nx.relabel_nodes(G, mapping)

from karateclub import Graph2Vec

model = Graph2Vec()
model.fit([H])
X = model.get_embedding()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\...\lib\site-packages\karateclub\estimator.py in _check_connectivity(self, graph)
     45             if not connected:
---> 46                 raise ValueError("Graph is not connected. Please see requirements.")
     47         except:

ValueError: Graph is not connected. Please see requirements.

Enhancement with Node2Vec and GraphSAGE

Hi, nice package! Just wanted to know if there is any plan to support additional papers like Node2Vec and GraphSage. Especially Node2Vec, as it's an extension (generalization) of existing DeepWalk. Thnx!

Size of BoostNE embeddings

Hi all,

thanks for this great piece of software! I just started using your BoostNE implementations and was wondering why the dimensions of the output embeddings do not follow the dimensions parameter. The output instead has the size of d=dimensions+dimensions*iterations.

Is this a bug or a feature?

Best
Malte

Role2Vec, Diff2Vec, and NodeSketch are throwing on directed graph

Role2Vec, Diff2Vec, and NodeSketch are throwing errors for me on directed graphs. But, LaplacianEigenmaps gives a message alerting that it's not implemented for directed graphs. Is that the case too for the first three, or are these bugs?

When I convert to undirected all three of these work.

Here are the relevant stack traces,

Role2Vec:

~/.local/lib/python3.7/site-packages/karateclub/node_embedding/structural/role2vec.py in fit(self, graph)
     90 
     91         node_features = hasher.get_node_features()
---> 92         documents = self._create_documents(walker.walks, node_features)
     93 
     94         model = Doc2Vec(documents,

~/.local/lib/python3.7/site-packages/karateclub/node_embedding/structural/role2vec.py in _create_documents(self, walks, features)
     69                     source = walk[i]
     70                     target = walk[i+j]
---> 71                     new_features[source].append(features[target])
     72                     new_features[target].append(features[source])
     73 

KeyError: 0

Diff2Vec:

~/.local/lib/python3.7/site-packages/karateclub/node_embedding/neighbourhood/diff2vec.py in fit(self, graph)
     42         """
     43         diffuser = EulerianDiffuser(self.diffusion_number, self.diffusion_cover)
---> 44         diffuser.do_diffusions(graph)
     45 
     46         model = Word2Vec(diffuser.diffusions,

~/.local/lib/python3.7/site-packages/karateclub/utils/diffuser.py in do_diffusions(self, graph)
     55         for _ in range(self.diffusion_number):
     56             for node in self.graph.nodes():
---> 57                 diffusion_sequence = self._run_diffusion_process(node)
     58                 self.diffusions.append(diffusion_sequence)

~/.local/lib/python3.7/site-packages/karateclub/utils/diffuser.py in _run_diffusion_process(self, node)
     32             end_point = random.sample(infected, 1)[0]
     33             nebs = [node for node in self.graph.neighbors(end_point)]
---> 34             sample = random.choice(nebs)
     35             if sample not in infected:
     36                 infected_counter = infected_counter + 1

/usr/lib/python3.7/random.py in choice(self, seq)
    259             i = self._randbelow(len(seq))
    260         except ValueError:
--> 261             raise IndexError('Cannot choose from an empty sequence') from None
    262         return seq[i]
    263 

IndexError: Cannot choose from an empty sequence

NodeSketch:

~/.local/lib/python3.7/site-packages/karateclub/node_embedding/neighbourhood/nodesketch.py in fit(self, graph)
     91         self._do_single_sketch()
     92         for _ in range(self.iterations-1):
---> 93             self._augment_sla()
     94             self._do_single_sketch()
     95 

~/.local/lib/python3.7/site-packages/karateclub/node_embedding/neighbourhood/nodesketch.py in _augment_sla(self)
     58         for node in range(self._num_nodes):
     59             frequencies = []
---> 60             for neighbor in list(self._graph[node]):
     61                 frequencies.append(Counter([dim[neighbor] for dim in self._sketch]))
     62             frequencies = sum(frequencies, Counter())

~/.local/lib/python3.7/site-packages/networkx/classes/graph.py in __getitem__(self, n)
    456         AtlasView({1: {}})
    457         """
--> 458         return self.adj[n]
    459 
    460     def add_node(self, node_for_adding, **attr):

~/.local/lib/python3.7/site-packages/networkx/classes/coreviews.py in __getitem__(self, name)
     79 
     80     def __getitem__(self, name):
---> 81         return AtlasView(self._atlas[name])
     82 
     83     def copy(self):

KeyError: 0

Node feature

I checked the sample dataset and source code. It seems that the node feature is a value. Is it possible to use vector feature?

Graph embedding with direction and weight

Hi, thanks for your great work.

I'm looking for a way to make whole graph embeddings with directed edges and multi dimension weights for nodes.
Do you know which API can do it?
I have tried all of graph embedding API in this, but could not see find a good one.
TIA.

How to set the maximum node number in each community?

Hi,

Thanks for your good work!

I want use community detection methods, such as GEMSEC, to separate the whole graph into
several non-overlapping subgraphs. At the same time, I want the maximum node number in
each subgraph is under a certain value, for example 100. How can I make this implement possible?

Looking forward for your attention and help.

Problem with ED-mot

I don't think Ed-mot is properly implemented in this package, in the paper, the connected components are not just joined into a giant clique as is performed in the implementation of this. In the paper the connected components are partitioned into so called modules by the Louvain community detection algorithm and and then those nodes in the modules are connected to form a clique, this is a subtle point but important for the algorithm. This is not the Ed-mot algorithm as described in the paper.

Numerical instability of NNSED

When I implemented NNSED on my own network, I encountered the following problem:

C:\Users\Administrator\anaconda3\lib\site-packages\karateclub\community_detection\overlapping\nnsed.py:75: RuntimeWarning: invalid value encountered in true_divide
self._W = self._W*(enum/denom)

and this is caused by some zero entries in denom.
After adding the following codes, my issues are resolved successfully, and NNSED can now converge.

def _update_W(self, A):
        enum = A.dot(self._Z.T)
        denom_1 = self._W.dot(self._Z).dot(self._Z.T)
        denom_2 = (A.dot(A.transpose())).dot(self._W)
        denom = denom_1 + denom_2 + **1e-6**
        self._W = self._W*(enum/denom)

def _update_Z(self, A):
        enum = (A.dot(self._W)).transpose()
        denom = np.dot(np.dot(self._W.T, self._W), self._Z) + self._Z + **1e-6**
        self._Z = self._Z*(enum/denom)

I hope this could be fixed in the next version.

Community Detection on My Own Graph

Hi,

I try to conduct community detection on my own graph with GEMSEC example but an error raises:

File "/home/bright/.local/lib/python3.6/site-packages/karateclub/estimator.py", line 57, in _check_indexing
assert numeric_indices == node_indices, "The node indexing is wrong."

The name of my node is a string type, I do not know whether this is the reason?
If I use my own graph for community detection, should I rename my node name from '1' to 'max_number'?

Thanks a lot for your attention and help.

Ask question about model HOPE

This is part of code of HOPE

    def _do_rescaled_decomposition(self, S):
        """
        Decomposing the similarity matrix.
        """
        U, sigmas, Vt = sps.linalg.svds(S, k=int(self.dimensions/2))
        sigmas = np.diagflat(np.sqrt(sigmas))
        self._left_embedding = np.dot(U, sigmas)
        self._right_embedding = np.dot(Vt.T, sigmas)

    def fit(self, graph: nx.classes.graph.Graph):
        """
        Fitting a HOPE model.

        Arg types:
            * **graph** *(NetworkX graph)* - The graph to be embedded.
        """
        self._set_seed()
        self._check_graph(graph)
        S = self._create_target(graph)
        self._do_rescaled_decomposition(S)

It uses the svd, but in paper (https://www.kdd.org/kdd2016/papers/files/rfp0184-ouA.pdf), it uses approximation of high-order proximity which could improve the efficiency to some extent. The algorithm as follows,
image

Implementation of SF

Hello,

I have been using the implementation of SF (details found in this paper: https://arxiv.org/pdf/1810.09155.pdf), and I am wondering if there is an inaccuracy.

I can create a graph like:

mat = [[0,1,1,0],[1,0,0,0],[1,0,0,1],[0,0,1,0]]
mat = np.matrix(mat)
G = nx.from_numpy_matrix(mat)

I've pulled the code out of the class, and into a separate function, but it follows the same logic:

def embedder(G, k):
    number_of_nodes = G.number_of_nodes()
    lap = nx.normalized_laplacian_matrix(G, nodelist=range(number_of_nodes))
    if number_of_nodes <= k:
        embedding = eigsh(lap, k=number_of_nodes-1, which='LM',
                          ncv=10*k, return_eigenvectors=False)
        shape_diff = k - embedding.shape[0] - 1
        embedding = np.pad(embedding, (1, shape_diff), 'constant', constant_values=0)
    else:
        embedding = eigsh(lap, k=k, which='LM',
                                  ncv=10*k, return_eigenvectors=False)
    return embedding

If we now test it with:

embedder(G, len(G))

we get the result:

array([0. , 0.5, 1.5, 2. ])

And then if we test:

embedder(G, 2)

we get the answer:

array([1.5, 2. ])

I get similar answers using the class itself. However, the paper states "We use the k smallest positive eigenvalues of L in ascending order as input of the classifier". For our first result, shouldn't the leading zero not be included (only positive eigenvalues). For our second we have accidentally chosen the largest two values - not the smallest.

A solution may be:

def embedder(G, k):
    number_of_nodes = G.number_of_nodes()
    lap = nx.normalized_laplacian_matrix(G, nodelist=range(number_of_nodes))

    embedding = eigsh(lap, k=number_of_nodes-1, which='LM',
                      ncv=10*number_of_nodes, 
                      return_eigenvectors=False)
    
    embedding = embedding[embedding > 0]
    
    if len(embedding) < k:
        # since nodes < k, pad out the embedding
        shape_diff = k - embedding.shape[0]
        embedding = np.pad(embedding, (0, shape_diff), 'constant', constant_values=0)

    if len(embedding) > k:
        embedding = embedding[:k]
        
    return embedding

Now, the two answers we get are array([0.5, 1.5, 2. , 0. ]) and array([0.5, 1.5 ]). I believe this is more in line with the methodology the paper suggested.

A question for BANE, TENE and TADW

Hi,

I am trying to use there 3 methods, but my graph is not connected:

site-packages/karateclub/estimator.py", line 50, in _check_connectivity
raise ValueError("Graph is not connected. Please see requirements.")

Could I know if it is necessary to make graph connected? and how to avoid this problem?

Thanks

Graphwave error

AttributeError: 'GraphWave' object has no attribute 'eigen_vectors'

HOPE for directed graphs?

The HOPE paper makes it seem like HOPE is intended to embed nodes in a directed graph.

However, it seems that the HOPE implementation in karateclub only supports undirected graphs.

I also noticed that directed graphs cause an assertion violation when Estimator._check_graph is called.

Does the HOPE implementation only support undirected graphs? This seems unintuitive to me since my understanding of the HOPE paper was that one of the main benefits of HOPE was that it could embed nodes in a directed graph.

Also, instead of having an exception be raised when a directed graph is given, can the directedness check happen first in Estimator._check_graph to avoid this?

graph2vec implementation and graphs with missing nodes

Hi there,

first of all, thanks a lot for developing this, it has potential to simplify in-silico experiments on biological networks and I am grateful for that!

I have a question related to the graph2vec implementation. The requirement of the package for graph notation is that nodes have to be named with integers starting from 0 and have to be consecutive. I am working with a collection of 9.000 small networks and would like to embed all of them into an N-dimensional space. Now, all those networks consist of about 25.000 nodes but in some networks these nodes (here it's really genes) are missing (not all genes are supposed to be present in all networks).

If I rename all my nodes from actual gene names to integers and know that some networks don't have all the genes, I will end up with some networks without consecutive node names, e.g. there will be (..), 20, 21, 24, 25, (...) in one network and perhaps (...), 20, 21, 22, 24, 25, (...) in another. That would violate the requirement of being consecutive.

My question is: is the implementation aware that a node 25 is the same object between the different networks? Or is it not important and in reality the embedding only takes into account the structure only and I should 'rename' all my networks separately to keep the node naming consecutive?

weighted graphs

Great package. Makes it very easy to experiment with newest algorithms.
I have a question though: did I understand it correctly that it does not support weighted graphs ?

Wrong requirement sklearn

Hi, it looks like this library is using the wrong requirement sklearn. It should be scikit-learn.

The requirement is here:

karateclub/setup.py

Lines 3 to 4 in 852000a

install_requires = ["numpy", "networkx", "tqdm", "python-louvain", "sklearn",
"scipy","pygsp", "gensim", "pandas", "six"]

If you visit sklearn PyPI page, you'll see it says to use scikit-learn instead: https://pypi.org/project/sklearn/

Bug in treesfeatures.py

Hello,

I noticed an error in the WeisfeilerLehmanHashing class from utils.treefeatures.py which is used by Graph2Vec and Gl2vec for building the list of nodes rooted subtrees of the each graph up to the number of WL iterations (argument: wl_iterations).

I implemented a correction of WeisfeilerLehmanHashing and show you hereafter a comparison to demonstrate the problem.

reproducible code

It
(i) generates a simple chain graph
(ii) Run the WL algorithm implemented in your original code and print the WL features hash codes outputs.
(iii) Run the corrected algorithm and print the WL features hash codes outputs.
(iv) Print the WL nodes rooted trees associated with the WL features hash codes (they are in the same order).

import numpy as np
import networkx as nx
from karateclub.utils.treefeatures import WeisfeilerLehmanHashing
import hashlib

# We build a directed attributed graph (4 nodes chain)
A = np.matrix([[0,1,0,0],[0,0,1,0],[0,0,0,1],[0,0,0,0]])
G = nx.DiGraph(incoming_graph_data= A ) 
nx.set_node_attributes(G, {0:1,1:2,2:3,3:4}, name= 'feature')

print('Run original version')
WL = WeisfeilerLehmanHashing(graph= G, wl_iterations= 3, attributed=True)

print('Number of final WL features')
print(len(WL.get_graph_features()))
print('Final WL features')
print(WL.get_graph_features())

class CorrectedWeisfeilerLehmanHashing(object):
	"""
	Weisfeiler-Lehman feature extractor class.
	
	Args:
		graph (NetworkX graph): NetworkX graph for which we do WL hashing.
		features (dict of strings): Feature hash map.
		iterations (int): Number of WL iterations.
	"""
	def __init__(self, graph, wl_iterations, attributed):
		"""
		Initialization method which also executes feature extraction.
		"""
		self.wl_iterations = wl_iterations
		self.graph = graph
		self.attributed = attributed
		
		#____ ADDED (For visualising trees only)
		self.subtrees = []
		self.all_subtrees = {}
		#_______
		
		self._set_features()
		self._do_recursions()
		
	def _set_features(self):
		"""
		Creating the features.
		"""
		if self.attributed:
			self.features = nx.get_node_attributes(self.graph, 'feature')
		else:
			self.features = {node: self.graph.degree(node) for node in self.graph.nodes()}
		#____ ADDED
		self.extracted_features = {k: [str(v)] for k, v in self.features.items()}		
		
		#(For visualising trees only)
		if self.attributed:
			self.subtrees = nx.get_node_attributes(self.graph, 'feature')
		else:
			self.subtrees = {node: self.graph.degree(node) for node in self.graph.nodes()}
		self.all_subtrees = {k:[v] for k,v in self.subtrees.items()}
		#____
	
	def _do_a_recursion(self):
		"""
		The method does a single WL recursion.
		Return types:
			* **new_features** *(dict of strings)* - The hash table with extracted WL features.
		"""
		#____ DELETED
		#self.extracted_features = {k: [str(v)] for k, v in self.features.items()}
		#____
		
		#____ ADDED (For visualising trees only)
		new_trees = {}
		#_______
		
		new_features = {}
		for node in self.graph.nodes():
			nebs = self.graph.neighbors(node)
			degs = [self.features[neb] for neb in nebs]
			features = [str(self.features[node])]+sorted([str(deg) for deg in degs])
			features = "_".join(features)
			hash_object = hashlib.md5(features.encode())
			hashing = hash_object.hexdigest()
			new_features[node] = hashing
				
			#____ ADDED (For visualising trees only)
			nebs = self.graph.neighbors(node)
			neigbor_trees = [self.subtrees[neb] for neb in nebs]
			ordered_neigbor_trees = [str(self.subtrees[node])]+sorted([str(tree) for tree in neigbor_trees])
			new_node_rooted_tree = "("+"_".join(ordered_neigbor_trees)+")"
			new_trees[node] = new_node_rooted_tree
			#_______				
			
		self.extracted_features = {k: self.extracted_features[k] + [v] for k, v in new_features.items()} 
		
		#____ ADDED (For visualising trees only)
		self.all_subtrees = {k : self.all_subtrees[k] + [v] for k,v in new_trees.items()}
		self.subtrees = new_trees
		#____
		
		#____ ADDED
		# we remove the initial non encoded feature for each node if still there
		for k,v in new_features.items():
			if len(self.extracted_features[k][0])!=32:
				del self.extracted_features[k][0:1]
		#____
		
		return new_features
	
	def _do_recursions(self):
		"""
		The method does a series of WL recursions.
		"""
		for _ in range(self.wl_iterations):
			self.features = self._do_a_recursion()
		
		#____ ADDED (For visualising trees only)
		for k,v in self.all_subtrees.items():
			del self.all_subtrees[k][0:1]
		#____
		
	def get_node_features(self):
		"""
		Return the node level features.
		"""
		return self.extracted_features
		
	def get_graph_features(self):
		"""
		Return the graph level features.
		"""
		return [feature for node, features in self.extracted_features.items() for feature in features]
	
	#____ ADDED (For visualising trees only)
	def get_subtrees(self):
		"""
		Return the nodes rooted subtrees of all WL iterations
		"""
		return self.all_subtrees

print('Run corrected version')

WL = CorrectedWeisfeilerLehmanHashing(graph= G, wl_iterations= 3, attributed=True)

print('Number of final WL features')
print(len(WL.get_graph_features()))
print('Final WL features')
print(WL.get_graph_features())

print('Corresponding nodes rooted subtrees')
print(WL.all_subtrees)

Here is the output of this script:
output

Explanation

Indeed, you can see first that the original code only produces 8 (4x2) WL features hash codes instead of 12 (4x3) in the corrected version. The algorithm must produce one WL feature per node and iteration.

You can also see that 8 codes among the corrected output match the non-corrected output. Now looking at the output of WL.all_subtrees, whose order match the features in WL.get_graph_features(), you can see that the codes that were lacking in the non-corrected version correspond to the first WL iteration.

Summary

The problem of the original code is that it only keep the WL features of the 2 last iterations. I hope that this is clear enough. Of course you can use directly the corrected version. I kept track of the changes and highlighted those that I only use for the trees strings construction.

Could you please apply the changes to the package? I'm using it for research and I think it would be much easier for everyone that I directly refer to your package rather than adding a corrected code in a personal repo.

Anyway, thank you for this useful package =)

GEMSEC example

Hi --

I'm trying to use your GEMSEC implementation, but having a little bit of trouble figuring out how to run it on an example dataset. Are you able to share an example of training and evaluating the model on one of the datasets here?

Thanks!
~ Ben

Multithreading for WL hasing function

Hi!

Maybe just another suggestion. In the embedding algorithms, the WeisfeilerLehmanHashing function in the fit function could be time-consuming and the WL hashing function for each graph is independent. Therefore, maybe using multhreading from python can speed them up and I modify the code for my application of graph2vec:

==================================

def fit(self, graphs):
    """
    Fitting a Graph2Vec model.

    Arg types:
        * **graphs** *(List of NetworkX graphs)* - The graphs to be embedded.
    """
    pool = ThreadPool(8)
    args_generator = [(graph, self.wl_iterations, self.attributed) for graph in graphs]
    documents = pool.starmap(WeisfeilerLehmanHashing, args_generator)
    pool.close()
    pool.join()
    #documents = [WeisfeilerLehmanHashing(graph, self.wl_iterations, self.attributed) for graph in graphs]
    documents = [TaggedDocument(words=doc.get_graph_features(), tags=[str(i)]) for i, doc in enumerate(documents)]

    model = Doc2Vec(documents,
                    vector_size=self.dimensions,
                    window=0,
                    min_count=self.min_count,
                    dm=0,
                    sample=self.down_sampling,
                    workers=self.workers,
                    epochs=self.epochs,
                    alpha=self.learning_rate,
                    seed=self.seed)

    self._embedding = [model.docvecs[str(i)] for i, _ in enumerate(documents)]

Using Graph2Vec with attributes raising an error

When using
g2v_object = Graph2Vec(attributed=True)
an error raised in fit:

~/.local/lib/python3.6/site-packages/karateclub/utils/treefeatures.py in _set_features(self)
26 """
27 if self.attributed:
---> 28 self.features = nx.get_node_attributes(G, 'feature')
29 else:
30 self.features = {node: self.graph.degree(node) for node in self.graph.nodes()}
NameError: name 'G' is not defined

Probably in

self.features = nx.get_node_attributes(G, 'feature')

should change G to self.graph

Graph Embeddings using node features and inductivity

Hello,

First of all thank you for this amazing library! I have a serie of small graphs where each node contains features and I am trying to learn graph-level embedding in an unsupervised manner.
However, I couldn't find how to load node features in the graphs before feeding them to a graph embedding algorithm. Could you describe the input needed by the algorithms ?

Also, is it possible to generate embedding with some sort of forward function once the models are trained (without retraining the model) ? I.e. does the library support inductivity ?

Thank you!

Potential Problem of function _setup_target_matrices in danmf.py

Hi Benedek,

According to my application, some datasets, in which the node indexs might not be continuous, will lead to wrong returned result of function _setup_target_matrices, caused by the expression " nodelist=range(self.graph.number_of_nodes())". To reproduce the problem, you can try a graph generated from: G.add_edges_from([(0,1),(7,8)])
Maybe this is just what I encountered. I walk around the problem by giving the nodes new index.

Best regards,

Tingyuan

.

def _setup_target_matrices(self, graph):
    """
    Setup target matrix for pre-training process.
    Arg types:
        * **graph** *(NetworkX graph)* - The graph being clustered.
    """
    self.graph = graph
    self.A = nx.adjacency_matrix(self.graph, nodelist=range(self.graph.number_of_nodes()))
    self.L = nx.laplacian_matrix(self.graph, nodelist=range(self.graph.number_of_nodes()))
    self.D = self.L+self.A

A mistake in MNMF

In your code, self._ B2 is created as the modularity matrix, and it is corresponding to $\mathbf{B}_ {1}$ in the original MNMF paper. However, in the paper, $\mathbf{B}_ {1}$ is not the modularity matrix, and it is defined as
image
So the code

def _modularity_generator(self):
        """Calculating the sparse modularity matrix."""
        degs = nx.degree(self._graph)
        e_count = self._graph.number_of_edges()
        n_count = self._graph.number_of_nodes()
        modularity_mat_shape = (n_count, n_count)
        indices_1 = np.array([edge[0] for edge in self._graph.edges()] + [edge[1] for edge in self._graph.edges()])
        indices_2 = np.array([edge[1] for edge in self._graph.edges()] + [edge[0] for edge in self._graph.edges()])
        scores = [1.0-(float(degs[e[0]]*degs[e[1]])/(2*e_count)) for e in self._graph.edges()]
        scores = scores + [1.0-(float(degs[e[1]]*degs[e[0]])/(2*e_count)) for e in self._graph.edges()]
        mod_matrix = coo_matrix((scores, (indices_1, indices_2)), shape=modularity_mat_shape)
        return mod_matrix

should be revised to

def _modularity_generator(self):
        """Calculating the sparse modularity matrix."""
        degs = nx.degree(self._graph)
        e_count = self._graph.number_of_edges()
        n_count = self._graph.number_of_nodes()
        modularity_mat_shape = (n_count, n_count)
        indices_1 = np.array([edge[0] for edge in self._graph.edges()] + [edge[1] for edge in self._graph.edges()])
        indices_2 = np.array([edge[1] for edge in self._graph.edges()] + [edge[0] for edge in self._graph.edges()])
        scores = [float(degs[e[0]]*degs[e[1]])/(2*e_count) for e in self._graph.edges()]
        scores = scores + [float(degs[e[1]]*degs[e[0]])/(2*e_count) for e in self._graph.edges()]
        mod_matrix = coo_matrix((scores, (indices_1, indices_2)), shape=modularity_mat_shape)
        return mod_matrix

I've opened a PR about this problem.

Inference time

Hi, for attribute based approaches, is there a way to get embeddings at test time on unseen data ?
For example: after embedder.fit(graph, X) , can i input for example embbeder.predict([some unseen input features]) to get embeddings at test time ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.