benedekrozemberczki / karateclub Goto Github PK

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Home Page: https://karateclub.readthedocs.io

License: GNU General Public License v3.0

Python 100.00%

community-detection graph-clustering deepwalk networkx louvain network-science machine-learning unsupervised-learning gcn node2vec

karateclub's Issues

GEMSEC example

Hi --

I'm trying to use your GEMSEC implementation, but having a little bit of trouble figuring out how to run it on an example dataset. Are you able to share an example of training and evaluating the model on one of the datasets here?

Thanks!
~ Ben

Wrong requirement sklearn

Hi, it looks like this library is using the wrong requirement sklearn. It should be scikit-learn.

The requirement is here:

karateclub/setup.py

Lines 3 to 4 in 852000a

    
           install_requires = ["numpy", "networkx", "tqdm", "python-louvain", "sklearn", 
        
                               "scipy","pygsp", "gensim", "pandas", "six"]

If you visit sklearn PyPI page, you'll see it says to use scikit-learn instead: https://pypi.org/project/sklearn/

Is it possible to save embeddings for later use?

Is there any way to save the embedding for later use as a file, e.g., result.emb?

The output file has n+1 lines for a graph with n vertices. The first line has the following format:
num_of_nodes dim_of_representation

The next n lines are as follows:
node_id dim1 dim2 ... dimd

where dim1, ... , dimd is the d-dimensional representation learned by the embedding method

Ex:
34 2
33 1.4270996 -1.6193795
0 -1.2400748 -1.3776317
32 1.6967283 -1.7419897
.
.
.
11 -1.193218 -1.3673347

Thank you

A mistake in MNMF

In your code, self._ B2 is created as the modularity matrix, and it is corresponding to $\mathbf{B}_ {1}$ in the original MNMF paper. However, in the paper, $\mathbf{B}_ {1}$ is not the modularity matrix, and it is defined as

So the code

def _modularity_generator(self):
        """Calculating the sparse modularity matrix."""
        degs = nx.degree(self._graph)
        e_count = self._graph.number_of_edges()
        n_count = self._graph.number_of_nodes()
        modularity_mat_shape = (n_count, n_count)
        indices_1 = np.array([edge[0] for edge in self._graph.edges()] + [edge[1] for edge in self._graph.edges()])
        indices_2 = np.array([edge[1] for edge in self._graph.edges()] + [edge[0] for edge in self._graph.edges()])
        scores = [1.0-(float(degs[e[0]]*degs[e[1]])/(2*e_count)) for e in self._graph.edges()]
        scores = scores + [1.0-(float(degs[e[1]]*degs[e[0]])/(2*e_count)) for e in self._graph.edges()]
        mod_matrix = coo_matrix((scores, (indices_1, indices_2)), shape=modularity_mat_shape)
        return mod_matrix

should be revised to

def _modularity_generator(self):
        """Calculating the sparse modularity matrix."""
        degs = nx.degree(self._graph)
        e_count = self._graph.number_of_edges()
        n_count = self._graph.number_of_nodes()
        modularity_mat_shape = (n_count, n_count)
        indices_1 = np.array([edge[0] for edge in self._graph.edges()] + [edge[1] for edge in self._graph.edges()])
        indices_2 = np.array([edge[1] for edge in self._graph.edges()] + [edge[0] for edge in self._graph.edges()])
        scores = [float(degs[e[0]]*degs[e[1]])/(2*e_count) for e in self._graph.edges()]
        scores = scores + [float(degs[e[1]]*degs[e[0]])/(2*e_count) for e in self._graph.edges()]
        mod_matrix = coo_matrix((scores, (indices_1, indices_2)), shape=modularity_mat_shape)
        return mod_matrix

I've opened a PR about this problem.

About GL2vec

Hello,
thanks for the awesome work!!

It seems that there are 2 mistakes in the implementation of GL2vec module.

The first one is :

in the code below,
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
def _create_line_graph(self, graph):
r"""Getting the embedding of graphs.
Arg types:
* graph (NetworkX graph) - The graph transformed to be a line graph.
Return types:
* line_graph (NetworkX graph) - The line graph of the source graph.
"""
graph = nx.line_graph(graph)
node_mapper = {node: i for i, node in enumerate(graph.nodes())}
edges = [[node_mapper[edge[0]], node_mapper[edge[1]]] for edge in graph.edges()]
line_graph = nx.from_edgelist(edges)
return line_graph
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
when converting graph G to line graph LG, the method "_create_line_graph()" ignores the edge attribute of G. (It means that there will be no node arrtibutes in LG)
so consequently, the method "WeisfeilerLehmanHashing" will not use the attribute information, and will always use the structural information (degree) instead.

The second one is :

The GL2vec module only returns the embedding of line graph.
But in the original paper of GL2vec, they concatenate the embedding of graph and of line graph.
Then named the framework "GL2vec", which means "Graph and Line graph to vector".

Only use the embedding of line graph for downstream task may lead to worse performance.

We noticed that when applying the embeddings to the graph classification task, (when graph both have node attribute and edge attribute) the performance (accuracy) are as follow:
concat(G , LG) > G > LG

Hope it helps :)

Graph embedding with direction and weight

Hi, thanks for your great work.

I'm looking for a way to make whole graph embeddings with directed edges and multi dimension weights for nodes.
Do you know which API can do it?
I have tried all of graph embedding API in this, but could not see find a good one.
TIA.

GL2vec : RuntimeError: you must first build vocabulary before training the model

Hello,
First thanks for your work, it's just great.

However, I have an error while trying to run GL2vec on my dataset, while it works perfectly with the example.
Where is exactly this type of error coming from ?

Thanks in advance

the meanings of parameters of TENE

what are the meanings of parameters of M,U,Q,V,C in the code corresponding to the paper?

Community Detection on My Own Graph

Hi,

I try to conduct community detection on my own graph with GEMSEC example but an error raises:

File "/home/bright/.local/lib/python3.6/site-packages/karateclub/estimator.py", line 57, in _check_indexing
assert numeric_indices == node_indices, "The node indexing is wrong."

The name of my node is a string type, I do not know whether this is the reason?
If I use my own graph for community detection, should I rename my node name from '1' to 'max_number'?

Thanks a lot for your attention and help.

Implementation of SF

Hello,

I have been using the implementation of SF (details found in this paper: https://arxiv.org/pdf/1810.09155.pdf), and I am wondering if there is an inaccuracy.

I can create a graph like:

mat = [[0,1,1,0],[1,0,0,0],[1,0,0,1],[0,0,1,0]]
mat = np.matrix(mat)
G = nx.from_numpy_matrix(mat)

I've pulled the code out of the class, and into a separate function, but it follows the same logic:

def embedder(G, k):
    number_of_nodes = G.number_of_nodes()
    lap = nx.normalized_laplacian_matrix(G, nodelist=range(number_of_nodes))
    if number_of_nodes <= k:
        embedding = eigsh(lap, k=number_of_nodes-1, which='LM',
                          ncv=10*k, return_eigenvectors=False)
        shape_diff = k - embedding.shape[0] - 1
        embedding = np.pad(embedding, (1, shape_diff), 'constant', constant_values=0)
    else:
        embedding = eigsh(lap, k=k, which='LM',
                                  ncv=10*k, return_eigenvectors=False)
    return embedding

If we now test it with:

embedder(G, len(G))

we get the result:

array([0. , 0.5, 1.5, 2. ])

And then if we test:

embedder(G, 2)

we get the answer:

array([1.5, 2. ])

I get similar answers using the class itself. However, the paper states "We use the k smallest positive eigenvalues of L in ascending order as input of the classifier". For our first result, shouldn't the leading zero not be included (only positive eigenvalues). For our second we have accidentally chosen the largest two values - not the smallest.

A solution may be:

def embedder(G, k):
    number_of_nodes = G.number_of_nodes()
    lap = nx.normalized_laplacian_matrix(G, nodelist=range(number_of_nodes))

    embedding = eigsh(lap, k=number_of_nodes-1, which='LM',
                      ncv=10*number_of_nodes, 
                      return_eigenvectors=False)
    
    embedding = embedding[embedding > 0]
    
    if len(embedding) < k:
        # since nodes < k, pad out the embedding
        shape_diff = k - embedding.shape[0]
        embedding = np.pad(embedding, (0, shape_diff), 'constant', constant_values=0)

    if len(embedding) > k:
        embedding = embedding[:k]
        
    return embedding

Now, the two answers we get are array([0.5, 1.5, 2. , 0. ]) and array([0.5, 1.5 ]). I believe this is more in line with the methodology the paper suggested.

How to set the maximum node number in each community?

Hi,

Thanks for your good work!

I want use community detection methods, such as GEMSEC, to separate the whole graph into
several non-overlapping subgraphs. At the same time, I want the maximum node number in
each subgraph is under a certain value, for example 100. How can I make this implement possible?

Looking forward for your attention and help.

Missing code snippets in readthedocs notes/introduction

Under the Introduction by example section, in the API driven design paragraph, it appears that there are two code snippets missing.

The first snippet ought to follow the sentence "This API driven design means that one can create a DeepWalk embedding of a Watts-Strogatz graph just like this."

The second one ought to follow the sentence "This can be modified to create a Walklets embedding with minimal effort like this."

(Also, changing the full point to a colon would read better.)

Graph classification in batches

Hi, thanks for open sourcing this great library!

I was wondering if you have any plans to add batch support for large graph datasets that don't fit into memory? For example, FeatherGraph (along with others) could load graphs from a list of files in batches instead of all at once.

HOPE for directed graphs?

The HOPE paper makes it seem like HOPE is intended to embed nodes in a directed graph.

However, it seems that the HOPE implementation in karateclub only supports undirected graphs.

Line 48 calls Estimator._check_graph.
Estimator._check_graph calls Estimator._check_connectivity, which calls nx.is_connected.
nx.is_connected raises an exception when a directed graph is given.

I also noticed that directed graphs cause an assertion violation when Estimator._check_graph is called.

Does the HOPE implementation only support undirected graphs? This seems unintuitive to me since my understanding of the HOPE paper was that one of the main benefits of HOPE was that it could embed nodes in a directed graph.

Also, instead of having an exception be raised when a directed graph is given, can the directedness check happen first in Estimator._check_graph to avoid this?

Extreme performance issues due to networkx

Hi,
I love this package. Is there any plan to move away from networkx?
Have a look at the comparison:
https://www.timlrx.com/2019/05/05/benchmark-of-popular-graph-network-packages/

NetworkX is the slowest.

Looking forward for a short reply.

About Edmot

this issue is same as:
benedekrozemberczki/EdMot#1

wait for your reply:)

TENE Issue

Question about overlapping community detecting algorithm.

Why the returned membership of each node is a single number, the belonging cluster is expected to an array in the sense of overlapping(each node may belong to multiple clusters). What's the difference between overlapping and non-overlapping detecting algorithms?

Potential Problem of function _setup_target_matrices in danmf.py

Hi Benedek,

According to my application, some datasets, in which the node indexs might not be continuous, will lead to wrong returned result of function _setup_target_matrices, caused by the expression " nodelist=range(self.graph.number_of_nodes())". To reproduce the problem, you can try a graph generated from: G.add_edges_from([(0,1),(7,8)])
Maybe this is just what I encountered. I walk around the problem by giving the nodes new index.

Best regards,

Tingyuan

.

def _setup_target_matrices(self, graph):
    """
    Setup target matrix for pre-training process.
    Arg types:
        * **graph** *(NetworkX graph)* - The graph being clustered.
    """
    self.graph = graph
    self.A = nx.adjacency_matrix(self.graph, nodelist=range(self.graph.number_of_nodes()))
    self.L = nx.laplacian_matrix(self.graph, nodelist=range(self.graph.number_of_nodes()))
    self.D = self.L+self.A

A question for BANE, TENE and TADW

Hi,

I am trying to use there 3 methods, but my graph is not connected:

site-packages/karateclub/estimator.py", line 50, in _check_connectivity
raise ValueError("Graph is not connected. Please see requirements.")

Could I know if it is necessary to make graph connected? and how to avoid this problem?

Thanks

Graph Embeddings using node features and inductivity

Hello,

First of all thank you for this amazing library! I have a serie of small graphs where each node contains features and I am trying to learn graph-level embedding in an unsupervised manner.
However, I couldn't find how to load node features in the graphs before feeding them to a graph embedding algorithm. Could you describe the input needed by the algorithms ?

Also, is it possible to generate embedding with some sort of forward function once the models are trained (without retraining the model) ? I.e. does the library support inductivity ?

Thank you!

Will it work on Windows?

Graphwave error

AttributeError: 'GraphWave' object has no attribute 'eigen_vectors'

weighted graphs

Great package. Makes it very easy to experiment with newest algorithms.
I have a question though: did I understand it correctly that it does not support weighted graphs ?

Multithreading for WL hasing function

Hi!

Maybe just another suggestion. In the embedding algorithms, the WeisfeilerLehmanHashing function in the fit function could be time-consuming and the WL hashing function for each graph is independent. Therefore, maybe using multhreading from python can speed them up and I modify the code for my application of graph2vec:

==================================

def fit(self, graphs):
    """
    Fitting a Graph2Vec model.

    Arg types:
        * **graphs** *(List of NetworkX graphs)* - The graphs to be embedded.
    """
    pool = ThreadPool(8)
    args_generator = [(graph, self.wl_iterations, self.attributed) for graph in graphs]
    documents = pool.starmap(WeisfeilerLehmanHashing, args_generator)
    pool.close()
    pool.join()
    #documents = [WeisfeilerLehmanHashing(graph, self.wl_iterations, self.attributed) for graph in graphs]
    documents = [TaggedDocument(words=doc.get_graph_features(), tags=[str(i)]) for i, doc in enumerate(documents)]

    model = Doc2Vec(documents,
                    vector_size=self.dimensions,
                    window=0,
                    min_count=self.min_count,
                    dm=0,
                    sample=self.down_sampling,
                    workers=self.workers,
                    epochs=self.epochs,
                    alpha=self.learning_rate,
                    seed=self.seed)

    self._embedding = [model.docvecs[str(i)] for i, _ in enumerate(documents)]

Does karateclub support an undirected weight graph?

Hello:
I'm lucky to find this great project.
I‘m solving a problem modeled as an undirected weight graph, and I read the source code of 'EgoNetSplitter', 'MNMF', 'GEMSEC', but there is no 'edge weight'.
Does karateclub support an undirected weight graph?

Ask question about model HOPE

This is part of code of HOPE

    def _do_rescaled_decomposition(self, S):
        """
        Decomposing the similarity matrix.
        """
        U, sigmas, Vt = sps.linalg.svds(S, k=int(self.dimensions/2))
        sigmas = np.diagflat(np.sqrt(sigmas))
        self._left_embedding = np.dot(U, sigmas)
        self._right_embedding = np.dot(Vt.T, sigmas)

    def fit(self, graph: nx.classes.graph.Graph):
        """
        Fitting a HOPE model.

        Arg types:
            * **graph** *(NetworkX graph)* - The graph to be embedded.
        """
        self._set_seed()
        self._check_graph(graph)
        S = self._create_target(graph)
        self._do_rescaled_decomposition(S)

It uses the svd, but in paper (https://www.kdd.org/kdd2016/papers/files/rfp0184-ouA.pdf), it uses approximation of high-order proximity which could improve the efficiency to some extent. The algorithm as follows,

Question: New Fields of Role Analysis

There has been discussion before about a place to collect role discovery algorithms and that it is often referenced with other topics:

GiulioRossetti/cdlib#137 (Role Discovery)
GiulioRossetti/cdlib#149 (Brokers & Structural Holes)
GiulioRossetti/cdlib#148 (Centrality)

Is there any good way of relating these topics together in this repo?

Node feature

I checked the sample dataset and source code. It seems that the node feature is a value. Is it possible to use vector feature?

No documentation in official pypi.org

There is no documentation in pypi.org "karateclub" i.e. https://pypi.org/project/karateclub/ page. You can create it using Sphinx very easily and it will create HTML pages as well if you want to publish the documentation on your own.

Documentation: Relationship with CDLib

It is known that https://github.com/GiulioRossetti/cdlib exists for the heavy inclusion of Community Detection algorithms:

Are there any relations or collaborations between these two repos to compensate for algorithm incompleteness?

Also other discussions: GiulioRossetti/cdlib#147

Graphwave broken...

It worked before but is broken now in the newest version

Bug in treesfeatures.py

Hello,

I noticed an error in the WeisfeilerLehmanHashing class from utils.treefeatures.py which is used by Graph2Vec and Gl2vec for building the list of nodes rooted subtrees of the each graph up to the number of WL iterations (argument: wl_iterations).

I implemented a correction of WeisfeilerLehmanHashing and show you hereafter a comparison to demonstrate the problem.

reproducible code

It
(i) generates a simple chain graph
(ii) Run the WL algorithm implemented in your original code and print the WL features hash codes outputs.
(iii) Run the corrected algorithm and print the WL features hash codes outputs.
(iv) Print the WL nodes rooted trees associated with the WL features hash codes (they are in the same order).

import numpy as np
import networkx as nx
from karateclub.utils.treefeatures import WeisfeilerLehmanHashing
import hashlib

# We build a directed attributed graph (4 nodes chain)
A = np.matrix([[0,1,0,0],[0,0,1,0],[0,0,0,1],[0,0,0,0]])
G = nx.DiGraph(incoming_graph_data= A ) 
nx.set_node_attributes(G, {0:1,1:2,2:3,3:4}, name= 'feature')

print('Run original version')
WL = WeisfeilerLehmanHashing(graph= G, wl_iterations= 3, attributed=True)

print('Number of final WL features')
print(len(WL.get_graph_features()))
print('Final WL features')
print(WL.get_graph_features())

class CorrectedWeisfeilerLehmanHashing(object):
	"""
	Weisfeiler-Lehman feature extractor class.
	
	Args:
		graph (NetworkX graph): NetworkX graph for which we do WL hashing.
		features (dict of strings): Feature hash map.
		iterations (int): Number of WL iterations.
	"""
	def __init__(self, graph, wl_iterations, attributed):
		"""
		Initialization method which also executes feature extraction.
		"""
		self.wl_iterations = wl_iterations
		self.graph = graph
		self.attributed = attributed
		
		#____ ADDED (For visualising trees only)
		self.subtrees = []
		self.all_subtrees = {}
		#_______
		
		self._set_features()
		self._do_recursions()
		
	def _set_features(self):
		"""
		Creating the features.
		"""
		if self.attributed:
			self.features = nx.get_node_attributes(self.graph, 'feature')
		else:
			self.features = {node: self.graph.degree(node) for node in self.graph.nodes()}
		#____ ADDED
		self.extracted_features = {k: [str(v)] for k, v in self.features.items()}		
		
		#(For visualising trees only)
		if self.attributed:
			self.subtrees = nx.get_node_attributes(self.graph, 'feature')
		else:
			self.subtrees = {node: self.graph.degree(node) for node in self.graph.nodes()}
		self.all_subtrees = {k:[v] for k,v in self.subtrees.items()}
		#____
	
	def _do_a_recursion(self):
		"""
		The method does a single WL recursion.
		Return types:
			* **new_features** *(dict of strings)* - The hash table with extracted WL features.
		"""
		#____ DELETED
		#self.extracted_features = {k: [str(v)] for k, v in self.features.items()}
		#____
		
		#____ ADDED (For visualising trees only)
		new_trees = {}
		#_______
		
		new_features = {}
		for node in self.graph.nodes():
			nebs = self.graph.neighbors(node)
			degs = [self.features[neb] for neb in nebs]
			features = [str(self.features[node])]+sorted([str(deg) for deg in degs])
			features = "_".join(features)
			hash_object = hashlib.md5(features.encode())
			hashing = hash_object.hexdigest()
			new_features[node] = hashing
				
			#____ ADDED (For visualising trees only)
			nebs = self.graph.neighbors(node)
			neigbor_trees = [self.subtrees[neb] for neb in nebs]
			ordered_neigbor_trees = [str(self.subtrees[node])]+sorted([str(tree) for tree in neigbor_trees])
			new_node_rooted_tree = "("+"_".join(ordered_neigbor_trees)+")"
			new_trees[node] = new_node_rooted_tree
			#_______				
			
		self.extracted_features = {k: self.extracted_features[k] + [v] for k, v in new_features.items()} 
		
		#____ ADDED (For visualising trees only)
		self.all_subtrees = {k : self.all_subtrees[k] + [v] for k,v in new_trees.items()}
		self.subtrees = new_trees
		#____
		
		#____ ADDED
		# we remove the initial non encoded feature for each node if still there
		for k,v in new_features.items():
			if len(self.extracted_features[k][0])!=32:
				del self.extracted_features[k][0:1]
		#____
		
		return new_features
	
	def _do_recursions(self):
		"""
		The method does a series of WL recursions.
		"""
		for _ in range(self.wl_iterations):
			self.features = self._do_a_recursion()
		
		#____ ADDED (For visualising trees only)
		for k,v in self.all_subtrees.items():
			del self.all_subtrees[k][0:1]
		#____
		
	def get_node_features(self):
		"""
		Return the node level features.
		"""
		return self.extracted_features
		
	def get_graph_features(self):
		"""
		Return the graph level features.
		"""
		return [feature for node, features in self.extracted_features.items() for feature in features]
	
	#____ ADDED (For visualising trees only)
	def get_subtrees(self):
		"""
		Return the nodes rooted subtrees of all WL iterations
		"""
		return self.all_subtrees

print('Run corrected version')

WL = CorrectedWeisfeilerLehmanHashing(graph= G, wl_iterations= 3, attributed=True)

print('Number of final WL features')
print(len(WL.get_graph_features()))
print('Final WL features')
print(WL.get_graph_features())

print('Corresponding nodes rooted subtrees')
print(WL.all_subtrees)

Here is the output of this script:

Explanation

Indeed, you can see first that the original code only produces 8 (4x2) WL features hash codes instead of 12 (4x3) in the corrected version. The algorithm must produce one WL feature per node and iteration.

You can also see that 8 codes among the corrected output match the non-corrected output. Now looking at the output of WL.all_subtrees, whose order match the features in WL.get_graph_features(), you can see that the codes that were lacking in the non-corrected version correspond to the first WL iteration.

Summary

The problem of the original code is that it only keep the WL features of the 2 last iterations. I hope that this is clear enough. Of course you can use directly the corrected version. I kept track of the changes and highlighted those that I only use for the trees strings construction.

Could you please apply the changes to the package? I'm using it for research and I think it would be much easier for everyone that I directly refer to your package rather than adding a corrected code in a personal repo.

Anyway, thank you for this useful package =)

graph2vec implementation and graphs with missing nodes

Hi there,

first of all, thanks a lot for developing this, it has potential to simplify in-silico experiments on biological networks and I am grateful for that!

I have a question related to the graph2vec implementation. The requirement of the package for graph notation is that nodes have to be named with integers starting from 0 and have to be consecutive. I am working with a collection of 9.000 small networks and would like to embed all of them into an N-dimensional space. Now, all those networks consist of about 25.000 nodes but in some networks these nodes (here it's really genes) are missing (not all genes are supposed to be present in all networks).

If I rename all my nodes from actual gene names to integers and know that some networks don't have all the genes, I will end up with some networks without consecutive node names, e.g. there will be (..), 20, 21, 24, 25, (...) in one network and perhaps (...), 20, 21, 22, 24, 25, (...) in another. That would violate the requirement of being consecutive.

My question is: is the implementation aware that a node 25 is the same object between the different networks? Or is it not important and in reality the embedding only takes into account the structure only and I should 'rename' all my networks separately to keep the node naming consecutive?

Role2Vec, Diff2Vec, and NodeSketch are throwing on directed graph

Role2Vec, Diff2Vec, and NodeSketch are throwing errors for me on directed graphs. But, LaplacianEigenmaps gives a message alerting that it's not implemented for directed graphs. Is that the case too for the first three, or are these bugs?

When I convert to undirected all three of these work.

Here are the relevant stack traces,

Role2Vec:

~/.local/lib/python3.7/site-packages/karateclub/node_embedding/structural/role2vec.py in fit(self, graph)
     90 
     91         node_features = hasher.get_node_features()
---> 92         documents = self._create_documents(walker.walks, node_features)
     93 
     94         model = Doc2Vec(documents,

~/.local/lib/python3.7/site-packages/karateclub/node_embedding/structural/role2vec.py in _create_documents(self, walks, features)
     69                     source = walk[i]
     70                     target = walk[i+j]
---> 71                     new_features[source].append(features[target])
     72                     new_features[target].append(features[source])
     73 

KeyError: 0

Diff2Vec:

~/.local/lib/python3.7/site-packages/karateclub/node_embedding/neighbourhood/diff2vec.py in fit(self, graph)
     42         """
     43         diffuser = EulerianDiffuser(self.diffusion_number, self.diffusion_cover)
---> 44         diffuser.do_diffusions(graph)
     45 
     46         model = Word2Vec(diffuser.diffusions,

~/.local/lib/python3.7/site-packages/karateclub/utils/diffuser.py in do_diffusions(self, graph)
     55         for _ in range(self.diffusion_number):
     56             for node in self.graph.nodes():
---> 57                 diffusion_sequence = self._run_diffusion_process(node)
     58                 self.diffusions.append(diffusion_sequence)

~/.local/lib/python3.7/site-packages/karateclub/utils/diffuser.py in _run_diffusion_process(self, node)
     32             end_point = random.sample(infected, 1)[0]
     33             nebs = [node for node in self.graph.neighbors(end_point)]
---> 34             sample = random.choice(nebs)
     35             if sample not in infected:
     36                 infected_counter = infected_counter + 1

/usr/lib/python3.7/random.py in choice(self, seq)
    259             i = self._randbelow(len(seq))
    260         except ValueError:
--> 261             raise IndexError('Cannot choose from an empty sequence') from None
    262         return seq[i]
    263 

IndexError: Cannot choose from an empty sequence

NodeSketch:

~/.local/lib/python3.7/site-packages/karateclub/node_embedding/neighbourhood/nodesketch.py in fit(self, graph)
     91         self._do_single_sketch()
     92         for _ in range(self.iterations-1):
---> 93             self._augment_sla()
     94             self._do_single_sketch()
     95 

~/.local/lib/python3.7/site-packages/karateclub/node_embedding/neighbourhood/nodesketch.py in _augment_sla(self)
     58         for node in range(self._num_nodes):
     59             frequencies = []
---> 60             for neighbor in list(self._graph[node]):
     61                 frequencies.append(Counter([dim[neighbor] for dim in self._sketch]))
     62             frequencies = sum(frequencies, Counter())

~/.local/lib/python3.7/site-packages/networkx/classes/graph.py in __getitem__(self, n)
    456         AtlasView({1: {}})
    457         """
--> 458         return self.adj[n]
    459 
    460     def add_node(self, node_for_adding, **attr):

~/.local/lib/python3.7/site-packages/networkx/classes/coreviews.py in __getitem__(self, name)
     79 
     80     def __getitem__(self, name):
---> 81         return AtlasView(self._atlas[name])
     82 
     83     def copy(self):

KeyError: 0

TADW where is the text feature?

karateclub/karateclub/node_embedding/attributed/tadw.py

Line 90 in 300bbbd

def _create_reduced_features(self, X):

and

karateclub/karateclub/node_embedding/attributed/feathernode.py

Line 82 in bd8fb5d

def _create_reduced_features(self, X):

are identical . they are in different papers but the comments are identical.
in TADW there is text feature matrix. where is it?

"get_cluster_centers( )" method for LabelProbagation not working

Even when i use your benchmark datasets this method not working.

Having some difficulty with embedding graphs and nodes

Good day,

I have been trying to create embeddings for nodes in a graph and for whole graphs using the FeatherNode and FeatherGraph functions.

What am I doing wrong? is it because not all the nodes in the graphs are connected? How can I get around this?

Regards
Chris

AssertionError Traceback (most recent call last)
in
1 model = FeatherNode()
----> 2 model.fit(G, expression.values)
3 emb = model.get_embedding()

~\anaconda3\envs\MIT_807_P38\lib\site-packages\karateclub\node_embedding\attributed\feathernode.py in fit(self, graph, X)
109 """
110 self._set_seed()
--> 111 self._check_graph(graph)
112 X = self._create_reduced_features(X)
113 A_tilde = self._create_A_tilde(graph)

~\anaconda3\envs\MIT_807_P38\lib\site-packages\karateclub\estimator.py in _check_graph(self, graph)
60 def _check_graph(self, graph: nx.classes.graph.Graph):
61 """Check the Karate Club assumptions about the graph."""
---> 62 self._check_connectivity(graph)
63 self._check_directedness(graph)
64 self._check_indexing(graph)

~\anaconda3\envs\MIT_807_P38\lib\site-packages\karateclub\estimator.py in _check_connectivity(self, graph)
42 """Checking the connected nature of a single graph."""
43 connected = nx.is_connected(graph)
---> 44 assert connected, "Graph is not connected."
45
46

AssertionError: Graph is not connected.

Using Graph2Vec with attributes raising an error

When using
g2v_object = Graph2Vec(attributed=True)
an error raised in fit:

~/.local/lib/python3.6/site-packages/karateclub/utils/treefeatures.py in _set_features(self)
26 """
27 if self.attributed:
---> 28 self.features = nx.get_node_attributes(G, 'feature')
29 else:
30 self.features = {node: self.graph.degree(node) for node in self.graph.nodes()}
NameError: name 'G' is not defined

Probably in

karateclub/karateclub/utils/treefeatures.py

Line 28 in f6d4de9

self.features = nx.get_node_attributes(G, 'feature')

should change G to self.graph

Graph2Vec: ValueError: Graph is not connected. Please see requirements.

Hi there,

I tried to run Graph2Vec and ensured the following is true:

The graph is undirected.
Nodes are indexed with integers.
There are no orphaned nodes in the graph.
The node indexing starts with zero and the indices are consecutive.

Yet, still get a ValueError: Graph is not connected. Please see requirements. Any insights as to what may be the problem.

I removed orphaned nodes and relabeled the indices.

G.remove_nodes_from(list(nx.isolates(G)))
mapping = {id: i for i, id in enumerate(G.nodes())}
H = nx.relabel_nodes(G, mapping)

from karateclub import Graph2Vec

model = Graph2Vec()
model.fit([H])
X = model.get_embedding()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\...\lib\site-packages\karateclub\estimator.py in _check_connectivity(self, graph)
     45             if not connected:
---> 46                 raise ValueError("Graph is not connected. Please see requirements.")
     47         except:

ValueError: Graph is not connected. Please see requirements.

Inference time

Hi, for attribute based approaches, is there a way to get embeddings at test time on unseen data ?
For example: after embedder.fit(graph, X) , can i input for example embbeder.predict([some unseen input features]) to get embeddings at test time ?

Graph not connected false error

When using GraphWave method with the same graph works on version 0.45.10 of karateclub but using the last versions shows an error of 'Graph not connected'. In that previous version the resulting embeddings were fine and worked well in tasks such as regression with Tensorflow. So it might be an error with the new version regarding the validation of the graph.

Thanks for this awesome python library!

Enhancement with Node2Vec and GraphSAGE

Hi, nice package! Just wanted to know if there is any plan to support additional papers like Node2Vec and GraphSage. Especially Node2Vec, as it's an extension (generalization) of existing DeepWalk. Thnx!

Numerical instability of NNSED

When I implemented NNSED on my own network, I encountered the following problem:

C:\Users\Administrator\anaconda3\lib\site-packages\karateclub\community_detection\overlapping\nnsed.py:75: RuntimeWarning: invalid value encountered in true_divide
self._W = self._W*(enum/denom)

and this is caused by some zero entries in denom.
After adding the following codes, my issues are resolved successfully, and NNSED can now converge.

def _update_W(self, A):
        enum = A.dot(self._Z.T)
        denom_1 = self._W.dot(self._Z).dot(self._Z.T)
        denom_2 = (A.dot(A.transpose())).dot(self._W)
        denom = denom_1 + denom_2 + **1e-6**
        self._W = self._W*(enum/denom)

def _update_Z(self, A):
        enum = (A.dot(self._W)).transpose()
        denom = np.dot(np.dot(self._W.T, self._W), self._Z) + self._Z + **1e-6**
        self._Z = self._Z*(enum/denom)

I hope this could be fixed in the next version.

Graph Embedding with Node and Edge Features

hi,
it is an awesome tool. I have some graphs. Each node and and edge has a vector feature.
I do not sure how to use kateclub with these graphs. What is the input format of graphs?

Connected graph constraint

It is slightly odd that many algorithms you provide demand that graphs must be connected. Could you at least provide a helper function that converts a graph to a connected one, by adding edges with weight of 0?

Size of BoostNE embeddings

Hi all,

thanks for this great piece of software! I just started using your BoostNE implementations and was wondering why the dimensions of the output embeddings do not follow the dimensions parameter. The output instead has the size of d=dimensions+dimensions*iterations.

Is this a bug or a feature?

Best
Malte

Whole graph embeddings when graph is not connected

Thank you for the awesome library!
I have been trying to embed whole graphs but my each of my graphs is not connected. Is it possible to get graph embedding for such graphs?

Problem with ED-mot

I don't think Ed-mot is properly implemented in this package, in the paper, the connected components are not just joined into a giant clique as is performed in the implementation of this. In the paper the connected components are partitioned into so called modules by the Louvain community detection algorithm and and then those nodes in the modules are connected to form a clique, this is a subtle point but important for the algorithm. This is not the Ed-mot algorithm as described in the paper.

The way to do overlapping community detection with DANMF

I can find lots of overlapping community detection algorithms in this library but only EgoNetSplitter returns multiple clusters for a node. How can I use DANMF to find overlapping clusters? Please give a hint.

	install_requires = ["numpy", "networkx", "tqdm", "python-louvain", "sklearn",
	"scipy","pygsp", "gensim", "pandas", "six"]

benedekrozemberczki / karateclub Goto Github PK

karateclub's Issues

.

reproducible code

Explanation

Summary

Recommend Projects

Recommend Topics

Recommend Org