benedekrozemberczki / graph2vec Goto Github PK

A parallel implementation of "graph2vec: Learning Distributed Representations of Graphs" (MLGWorkshop 2017).

Home Page: https://karateclub.readthedocs.io/

License: GNU General Public License v3.0

Python 100.00%

deepwalk diff2vec gemsec graph-embedding graph-kernel graph-wavelet graph2vec implicit-matrix-factorization kernel machine-learning matrix-factorization node-embedding node2vec noise-contrastive-estimation struc2vec subgraph2vec transformer unsupervised-learning weisfeiler-lehman word2vec

graph2vec's Introduction

Benedek A. Rozemberczki/ Homepage / Twitter / GitHub / Google Scholar

Welcome stranger

⏰ Currently working on machine learning for drug discovery.
🤖 I would love to collaborate on the machine learning libraries ChemicalX and RexMex.

Great news

🧬 MOOMIN: Deep Molecular Omics Network for Anti-Cancer Drug Combination Therapy was accepted at CIKM 2022.
🪙 The Shapley Value in Machine Learning was accepted at IJCAI 2022.
⭐ A Unified View of Relational Deep Learning for Drug Pair Scoring was accepted at IJCAI 2022.
⚗️ ChemicalX: A Deep Learning Library for Drug Pair Scoring was accepted at KDD 2022.

graph2vec's People

Contributors

Stargazers

Watchers

Forkers

majiga annamalai-nr sameeravithana m3tbe adrianbzg schaelle nextmap gryn010 zcrwind deepcolin radovankavicky gapdata codeaudit huyhoang17 shafiahmed zhouyonglong johndpope shubhampachori12110095 allensmile longjohncoder jialutu easy-peasy jbdatascience dengziming wurengukou chubbymaggie hunglethanh9 wbchen99 mahdeto phuysmans macos embeddedsamurai dhruvagupta2014 nguyenducnhaty alaincr wqw123 kaijuanyuan roysh naz947 ahooyuan hyeyeankkim gnn2qsu juexinwang uzeroj hamedmx express50 batermj ian-flores cchengv5 burakakrishna higd963 sidmasta11 fengnote pk0912 30lm32 sharpwei fredriko ojus1 xma24 mbrukman andercxt littlebadrobot vonrosenchild jamshaidsohail5 mishidemudong afcarl nlpka6j shalevy1 veritogen maxberezov davidebasilio phymucs milkigit akaprasanga sangminwoo stevenullman zhongbineden dmccreary debuluoyi adamblvck qianrenjian wqvigi oukui-t whutwuxiaofeng shoman2 island255 yangzk shhdan anyuanay jmj8038 sujhnkc jaeyun95 neveroldmilk tonny2v sjingwen saraalsaheel jindl465 gdls hanbei969 guriko1

graph2vec's Issues

error while running main

I installed all the dependencies as specified in the README and ran the example as given in the repo. I get the following error:

0it [00:00, ?it/s]Feature extraction started.



/datadisk/Workspace/test/graph2vec/venv/lib/python3.6/site-packages/gensim/models/doc2vec.py:566: UserWarning: The parameter `iter` is deprecated, will be removed in 4.0.0, use `epochs` instead.
Optimization started.

  warnings.warn("The parameter `iter` is deprecated, will be removed in 4.0.0, use `epochs` instead.")
/datadisk/Workspace/test/graph2vec/venv/lib/python3.6/site-packages/gensim/models/doc2vec.py:570: UserWarning: The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.
  warnings.warn("The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.")
2018-11-10 21:59:24,869 : INFO : collecting all words and their counts
2018-11-10 21:59:24,869 : INFO : collected 0 word types and 0 unique tags from a corpus of 0 examples and 0 words
2018-11-10 21:59:24,869 : INFO : Loading a fresh vocabulary
2018-11-10 21:59:24,869 : INFO : effective_min_count=5 retains 0 unique words (0% of original 0, drops 0)
2018-11-10 21:59:24,869 : INFO : effective_min_count=5 leaves 0 word corpus (0% of original 0, drops 0)
2018-11-10 21:59:24,869 : INFO : deleting the raw counts dictionary of 0 items
2018-11-10 21:59:24,869 : INFO : sample=0.0001 downsamples 0 most-common words
2018-11-10 21:59:24,869 : INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0)
2018-11-10 21:59:24,869 : INFO : estimated required memory for 0 words and 128 dimensions: 0 bytes
2018-11-10 21:59:24,869 : INFO : resetting layer weights
Traceback (most recent call last):
  File "/datadisk/Workspace/test/graph2vec/src/graph2vec.py", line 131, in <module>
    main(args)
  File "/datadisk/Workspace/test/graph2vec/src/graph2vec.py", line 125, in main
    alpha = args.learning_rate)
  File "/datadisk/Workspace/test/graph2vec/venv/lib/python3.6/site-packages/gensim/models/doc2vec.py", line 615, in __init__
    end_alpha=self.min_alpha, callbacks=callbacks)
  File "/datadisk/Workspace/test/graph2vec/venv/lib/python3.6/site-packages/gensim/models/doc2vec.py", line 795, in train
    queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks, **kwargs)
  File "/datadisk/Workspace/test/graph2vec/venv/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 1081, in train
    **kwargs)
  File "/datadisk/Workspace/test/graph2vec/venv/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 536, in train
    total_words=total_words, **kwargs)
  File "/datadisk/Workspace/test/graph2vec/venv/lib/python3.6/site-packages/gensim/models/base_any2vec.py", line 1187, in _check_training_sanity
    raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model

Process finished with exit code 1

Any suggestion what could be going wrong?

Visualisation of graph2vec embeddings in a network

Hello,

Is it possible to see the network resulting from the DeepWalk embedding of a graph2vec dataset?

The aim is to bring together the nodes that are most connected in order to ease their vizualisation.

Many thanks in advance,

Nicolas

ValueError: 11 columns passed, passed data had 14 columns

Hello, I am trying to run graph2vec on my own dataset. It seems like each graph will use different values of dimensions, as I encountered the following error during my run:

Traceback (most recent call last): File "graph2vec.py", line 137, in <module> main(args) File "graph2vec.py", line 133, in main save_embedding(args.output_path, model, graphs, args.dimensions) File "graph2vec.py", line 106, in save_embedding out = pd.DataFrame(out, columns=column_names) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 474, in __init__ arrays, columns = to_arrays(data, columns, dtype=dtype) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 461, in to_arrays return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype) File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 500, in _list_to_arrays raise ValueError(e) from e ValueError: 11 columns passed, passed data had 14 columns
Could you please suggest how to solve this problem?

node and edge attributes

how do we account for node and edge attributes? I read your answers where you mentioned we have to use a hash function, but I don't know how to do so, could you please help?

[Question] Add PyPi package

Hi, I find this package very interesting.
I was wondering if this could be added as a package in PyPI so that it can be easily installed/used inside larger projects.

Greetings

multiple features

If each node has many features instead of one, for example , {"edges": [[0, 1], [0, 2]], "features": {"0": "7,3,4", "1": "8,3,5", "2": "9,2,7"}} , how can I represent the dataset as input? Thank you.

Question about embeddings

Hello,
My goal is to use this model to obtain the graphs embeddings and use these to create a training set for another machine learning model. Is this correct? However, after training graph2vec on my custom dataset, how do I get the embeddings of an previously unseen graphs, i.e. a graph that are not in the training set?

Questions for some use cases

Hi, thanks for the great code.

Let me ask the followings.
I want to use multiple dimension vectors for the features.
For example, I want to input like this:
{"edges":[[0, 1]], "features": {"0": ["0.1", "0.1"], "1": ["0.0", "0.1"]}
Can we do it?

One more thing.
Can we save the model and use it for another input data which is not used for training and output the result CSV for it?

Best regards.

ValueError: invalid literal for int() with base 10: 'dataset\\0'

Cloned the repository and ran as-is with defaults and given sample data. Completes feature extraction ok but then getting the following error when it starts the optimization step:

Optimization started.

Traceback (most recent call last):
  File "src/graph2vec.py", line 129, in <module>
    main(args)
  File "src/graph2vec.py", line 125, in main
    save_embedding(args.output_path, model, graphs, args.dimensions)
  File "src/graph2vec.py", line 98, in save_embedding
    out.append([int(identifier)] + list(model.docvecs["g_"+identifier]))
ValueError: invalid literal for int() with base 10: 'dataset\\0'

Perhaps passing 'dataset\0' as identifier rather than 0?

Input JSON

I was having trouble running the script when the input dictionaries had no 'feature' key. Running the script would return this error:

'DegreeView' object has no attribute 'items'

I'm not the most experienced programmer so I could be wrong here but I think in networkx 2.4 networkx.Graph.degree returns an iterable and not a dict. So in line 72 of graph2vec.py, I had to consider the case when features was an iterable so I changed features.items() to list(features). And this worked for input dictionaries with no feature key so it now looks like:

if "features" in data.keys():
    features = data["features"]
    features = {int(k): v for k, v in features.items()}
else:
    features = nx.degree(graph)
    features = {int(k): v for k, v in list(features)}

Maybe I'm completely wrong but just in case anyone else had the same issue this might help since this worked for me.

graph2vec/src/graph2vec.py

Line 72 in a2001d0

features = {int(k): v for k, v in features.items()}

Error on executing graph2vec.py

After installing all required packages and executing the graph2vec.py i got the following error.
Any help is more than welcome

python src/graph2vec.py
Traceback (most recent call last):
  File "src/graph2vec.py", line 10, in <module>
    from joblib import Parallel, delayed
  File "/home/eleni/Documents/PythonProjects/graph2vec/venv/lib/python3.8/site-packages/joblib/__init__.py", line 119, in <module>
    from .parallel import Parallel
  File "/home/eleni/Documents/PythonProjects/graph2vec/venv/lib/python3.8/site-packages/joblib/parallel.py", line 28, in <module>
    from ._parallel_backends import (FallbackToBackend, MultiprocessingBackend,
  File "/home/eleni/Documents/PythonProjects/graph2vec/venv/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 22, in <module>
    from .executor import get_memmapping_executor
  File "/home/eleni/Documents/PythonProjects/graph2vec/venv/lib/python3.8/site-packages/joblib/executor.py", line 14, in <module>
    from .externals.loky.reusable_executor import get_reusable_executor
  File "/home/eleni/Documents/PythonProjects/graph2vec/venv/lib/python3.8/site-packages/joblib/externals/loky/__init__.py", line 12, in <module>
    from .backend.reduction import set_loky_pickler
  File "/home/eleni/Documents/PythonProjects/graph2vec/venv/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 125, in <module>
    from joblib.externals import cloudpickle  # noqa: F401
  File "/home/eleni/Documents/PythonProjects/graph2vec/venv/lib/python3.8/site-packages/joblib/externals/cloudpickle/__init__.py", line 3, in <module>
    from .cloudpickle import *
  File "/home/eleni/Documents/PythonProjects/graph2vec/venv/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle.py", line 167, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
  File "/home/eleni/Documents/PythonProjects/graph2vec/venv/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle.py", line 148, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)

model save and load

Hi
Will u tell us how to save and load the trained model?

Error for new dataset: RuntimeError: you must first build vocabulary before training the model

When changing the dataset to new one with the format of the test dataset, I get the following error:
RuntimeError: you must first build vocabulary before training the model.

Where should I specify the vocabulary and where should it be located?

Graph2Vec datasets

Hi,

Graph2Vec has wide applications but I don't find many datasets about graphs to make experiments and prove its efficiency.

Do you have a list of applicable datasets?

Many thanks in advance,

Nicolas

Edge Features

Hi, I am planning on using graph2vec for a research project. Am I right that this implementation does not allow for edge features to be included e.g. edge weights?
Thanks.

worse results with latest version

Installing the latest version, I am getting worse results than with an earlier version for the same experiment.
The issue seems to be with line 104 of graph2vec.py:
out.append([identifier] + list(model.docvecs["g_"+identifier]))
changing to:
out.append([int(identifier)] + list(model.docvecs["g_"+identifier]))
solved the issue for me.

error with custom data

I've changed my data as your dataset format, and check the name of files are integer and all of them json validated but got this error.

Feature extraction started.

100%|█████████████████████████████████████████████████████████████████████████████████████████| 741/741 [00:00<00:00, 819.88it/s]
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/EUGENE/anaconda3/envs/thesis/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
    r = call_item()
  File "/Users/EUGENE/anaconda3/envs/thesis/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/Users/EUGENE/anaconda3/envs/thesis/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 567, in __call__
    return self.func(*args, **kwargs)
  File "/Users/EUGENE/anaconda3/envs/thesis/lib/python3.7/site-packages/joblib/parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "/Users/EUGENE/anaconda3/envs/thesis/lib/python3.7/site-packages/joblib/parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "src/graph2vec.py", line 84, in feature_extractor
    graph, features, name = dataset_reader(path)
  File "src/graph2vec.py", line 67, in dataset_reader
    graph = nx.from_edgelist(data["edges"])
  File "/Users/EUGENE/anaconda3/envs/thesis/lib/python3.7/site-packages/networkx/convert.py", line 390, in from_edgelist
    G.add_edges_from(edgelist)
  File "/Users/EUGENE/anaconda3/envs/thesis/lib/python3.7/site-packages/networkx/classes/graph.py", line 969, in add_edges_from
    "Edge tuple %s must be a 2-tuple or 3-tuple." % (e,))
networkx.exception.NetworkXError: Edge tuple [] must be a 2-tuple or 3-tuple.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src/graph2vec.py", line 131, in <module>
    main(args)
  File "src/graph2vec.py", line 115, in main
    document_collections = Parallel(n_jobs = args.workers)(delayed(feature_extractor)(g, args.wl_iterations) for g in tqdm(graphs))
  File "/Users/EUGENE/anaconda3/envs/thesis/lib/python3.7/site-packages/joblib/parallel.py", line 934, in __call__
    self.retrieve()
  File "/Users/EUGENE/anaconda3/envs/thesis/lib/python3.7/site-packages/joblib/parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/Users/EUGENE/anaconda3/envs/thesis/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
    return future.result(timeout=timeout)
  File "/Users/EUGENE/anaconda3/envs/thesis/lib/python3.7/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/Users/EUGENE/anaconda3/envs/thesis/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
networkx.exception.NetworkXError: Edge tuple [] must be a 2-tuple or 3-tuple.

attaching one of my json files.

0101.json
{"edges": [[2, 3], [3, 4], [3, 5], [4, 8], [4, 10], [5, 6], [5, 16], [6, 22], [6, 114], [8, 9], [9, 17], [9, 18], [10, 20], [10, 61], [10, 21], [11, 12], [11, 13], [12, 66], [12, 78], [13, 14], [13, 15], [16, 24], [16, 25], [17, 26], [17, 28], [18, 19], [19, 29], [19, 30], [20, 93], [21, 33], [21, 35], [22, 23], [23, 36], [23, 37], [24, 38], [24, 40], [25, 43], [25, 41], [25, 44], [25, 45], [26, 27], [27, 46], [27, 47], [28, 51], [29, 52], [29, 53], [30, 54], [30, 92], [32, 58], [32, 64], [33, 34], [34, 68], [34, 79], [35, 76], [35, 80], [36, 11], [37, 11], [38, 39], [39, 82], [39, 87], [40, 115], [40, 83], [47, 48], [48, 49], [48, 50], [49, 25], [50, 81], [54, 55], [55, 94], [55, 98], [55, 107], [56, 57], [56, 60], [58, 59], [59, 62], [59, 63], [61, 104], [61, 106], [64, 65], [65, 102], [66, 70], [66, 74], [70, 71], [71, 72], [71, 103], [79, 86], [79, 88], [83, 85], [83, 84], [88, 89], [89, 90], [89, 91], [92, 56], [93, 31], [93, 32], [94, 96], [94, 100], [96, 97], [104, 105], [105, 107], [105, 109], [106, 111], [106, 113], [109, 110], [111, 112], [114, 61]], "features": {"2": "1", "3": "3", "4": "3", "5": "3", "8": "2", "10": "4", "6": "3", "16": "3", "22": "2", "114": "2", "9": "3", "17": "3", "18": "2", "20": "2", "61": "4", "21": "3", "11": "4", "12": "3", "13": "3", "66": "3", "78": "1", "14": "1", "15": "1", "24": "3", "25": "6", "26": "2", "28": "2", "19": "3", "29": "3", "30": "3", "93": "3", "33": "2", "35": "3", "23": "3", "36": "2", "37": "2", "38": "2", "40": "3", "43": "1", "41": "1", "44": "1", "45": "1", "27": "3", "46": "1", "47": "2", "51": "1", "52": "1", "53": "1", "54": "2", "92": "2", "32": "3", "58": "2", "64": "2", "34": "3", "68": "1", "79": "3", "76": "1", "80": "1", "39": "3", "82": "1", "87": "1", "115": "1", "83": "3", "48": "3", "49": "2", "50": "2", "81": "1", "55": "4", "94": "3", "98": "1", "107": "2", "56": "3", "57": "1", "60": "1", "59": "3", "62": "1", "63": "1", "104": "2", "106": "3", "65": "2", "102": "1", "70": "2", "74": "1", "71": "3", "72": "1", "103": "1", "86": "1", "88": "2", "85": "1", "84": "1", "89": "3", "90": "1", "91": "1", "31": "1", "96": "2", "100": "1", "97": "1", "105": "3", "109": "2", "111": "2", "113": "1", "110": "1", "112": "1"}}

Using one example

If I use graph2vec on a dataset is there any way I can save the model and use it on one example?

Correct Multiset Calculation

Thanks for your work on this library. I was looking through the code after I applied this to my own dataset and noticed a potential issue with the calculation of the label multiset in the WeisfeilerLehmanMachine.

In the original paper they provide this figure to explain one iteration of the relabeling process(p. 10):

In particular, the node with label 4 in the left graph has a pre-hashed label of 4,1135 - meaning it's original label is 4, and the multiset representation of its neighbors labels is 1135

I think there's an issue in how this multiset is calculated in the WeisfeilerLehmanMachine class, specifically this line: https://github.com/benedekrozemberczki/graph2vec/blob/master/src/graph2vec.py#L42

I've got a simple reproduction of this logic here. I am expecting the output to be 4_1_1_3_5, however, I get:

>>> node_label = "4"
>>> neighbor_labels = ["5", "3", "1", "1"]
>>> features = "_".join([node_label] + list(set(sorted([str(n) for n in neighbor_labels]))))
>>> print(features)
4_5_1_3

I believe you can fix this by removing the creation of the set, I'm not sure that's doing what we want. Since sets are unordered collections of objects, we are effectively ordering the objects with sorted and then unordering them with set.

I could be wrong as I didn't evaluate the full pipliene here - there maybe some other change in the input to this function that does give you the correct answer, but I think this part of the implementation is worth another look.

Different number node and edge

Hi,
I'm facing the problems about number of nodes and edges, I have 7 nodes but just have 6 edges. I'm facing the error: joblib.externals.loky.process_executor._RemoteTraceback: self.do_recursions(). How to implement graph2vec?

Graph2vec for graph similarity learning

I am thinking to use Graph2vec for learning graph similarity learning.

Given two graphs, I am thinking to get embeddings of the two graphs and then take the cosine similarity of the two graphs.

May graphs would have around 5000 nodes and 4000 edges.

Is Graph2vec a good fit for this task?

Choosing the right embedding dimension

Hi,
i am currently trying to figure out what dimension size to choose for my graph database.
Thats because I have relative small graphs with in average 7 (with predicates 14) nodes and 21 edges.
So with my calculation I come to an extracted_features size of 21 for each input graph, using wl_iteration of 2.
I'm using following formula for this:

num(extracted_feature) = num(nodes) + num(nodes) * wl_iterations

So this means in words of doc2vec, that each document has 21 words. Am i right in this point?
Because that seems a bit low to me, i wonder which dimension size to choose. Which dimension size should I use for this setup? A dimension size of 128 for me seems too high.

Greetings and thanks for this great work,
Christian

subgrahs

Can I use graph2vec to represent subgraphs(that contain two nodes or three) as vectors then I will cluster these subgraphs to N number clusters using K-means or Dbscan algorithms Is it a logical process ?

Thanks very much .

Graph2vec infer

Hi, I am trying to use the infer method to test my graph2vec embedding without retraining the model. I follow these steps

graphs_train = list of networkx graphs
graphs_test = list of networkx graphs
g2v_emb = g2v.Graph2Vec()
g2v_emb.fit(graphs_train)
test_emb = g2v_emb.infer(graphs_test)

However, this returns a list that looks like this: ['9', '595be5e4bb7167f35a2878342bed1787', '00129eb2ff1730f9d0db0e5e3ecb7059', '5', '92dd168e872bef807dc1617ddd3f4bd3', ...]. Any idea of where I am going wrong here?

Also, thank you for this repository, this code has been very helpful!

How about training an undirected weighted graph?

How about training an undirected weighted graph? How should I modify the code and input?
And How about dynamic graph input?

how to generate embeddings of graphml or graphson files as input using your library?

Here we are using .json files in the dataset, I have data in Graphml(XML) or graphson(Json) format, how can i use your library to generate embeddings using any of these graphml or graphson files?

or any other library you suggest which can do the same?

ValueError while using the default Dataset

I was having trouble using the tool with my dataset, so I tried using the default dataset using the following command:

python src/graph2vec.py --dimensions 32

But even on the default dataset provided, I get the following error:

Traceback (most recent call last):
  File "src/graph2vec.py", line 129, in <module>
    main(args)
  File "src/graph2vec.py", line 125, in main
    save_embedding(args.output_path, model, graphs, args.dimensions)
  File "src/graph2vec.py", line 98, in save_embedding
    out.append([int(identifier)] + list(model.docvecs["g_"+identifier]))
ValueError: invalid literal for int() with base 10: 'dataset\\0'

I tried fixing the error for the string format, but that raises some KeyErrors. Could I kindly get some assistance with this?

Edit: I have attached the KeyError below:

Traceback (most recent call last):
  File "src/graph2vec.py", line 130, in <module>
    main(args)
  File "src/graph2vec.py", line 126, in main
    save_embedding(args.output_path, model, graphs, args.dimensions)
  File "src/graph2vec.py", line 99, in save_embedding
    out.append([int(identifier)] + list(model.docvecs["g_"+identifier]))
  File "C:\AppData\Local\Programs\Python\Python37\lib\site-packages\gensim\models\keyedvectors.py", line 1613, in __getitem__
    raise KeyError("tag '%s' not seen in training corpus/invalid" % index)
KeyError: "tag 'g_0' not seen in training corpus/invalid"

Thank you

json file reading

Dear benedekrozemberczki,
I’m a student in Computer Science, from the University of Salerno.
I’m dealing with your graph2vec implementation after reading the related article, which I found very interesting.
When trying to run your code, I find a little bug.
Here below you can find the error and my fix.
parser.add_argument("--input-path", nargs="?", default="../dataset", help="Input folder with jsons.")

parser.add_argument("--output-path", nargs="?", default="../features/nci2.csv", help="Embeddings path.")
at line 111 of the script graph2vec, the variable global was empty, so without the json files of the dataset folder.
graphs = glob.glob(args.input_path + "*.json")

I rewrite the code in this way to extract the files into the directory:

graphs = []
pathfilename = args.input_path
for files in os.listdir(pathfilename):
graphs.append(pathfilename + '/' + files)

Hope this will help you :D
Given Graph2vec will be a main part of my thesis, don’t you mind if I contact you in case of difficulties?

Kind regards,
Gerardo "Dino" Benevento

comparing with original software package and citation

Dear author,
I want to ask, since the software is not the original graph2vec implementation, is this version of implementation generating same result comparing to the original version released by the author of the paper?
And If I used this work, except that paper, what else can I cite to show that I used your or got some idea from your software?

Thanks

can't improve my problem when add graph struct

I embedded the graph learned by Graph2VEc as the input data of neural network, and connect the sequence feature information learned by other LSTM, but the effect was always worse than embedding without graph structure and only using LSTM.Why is that?

Node Features

Hi every body
I have a question about node features, as default this features set by node degree, I replace it with a list as below for example:
{"edges": [[1, 2], [1, 3], [1, 4], [2, 3], [2, 4], [3, 4]], "features": {"1": 3, "2": 3, "3": 3, "4": 3}}
replace by
{"edges": [[1, 2], [1, 3], [1, 4], [2, 3], [2, 4], [3, 4]], "features": {"1": [3,1], "2": [3,1], "3": [3,1], "4": [3,1]}}

but I get error :
RuntimeError: you must first build vocabulary before training the model

How can I use a list as node features in input of algorithm?
I would be very thankful if you guide me.

Small question

Hi Benedek,

First of all, massive amounts of kudos for all the implementations I discovered from you today. Strong work!

Concerning this graph2vec algorithm... At first glance, it seems identical to the rdf2vec algorithm (with the Weisfeiler-Lehman extension), except for the fact that this algorithm cannot deal with named edges (only named nodes). Am I correct, or is there a significant difference that I am currently not seeing?

Also, nice implementation of the WL relabeling!

Thanks and kind regards,
Gilles

Getting This Error When Running on a graph with 1304 nodes

Does anyone know how to solve this?

KeyError: "tag 'g_0' not seen in training corpus/invalid"

Hi, I'm trying to run the program as indicated in the readme file, from the graph2vec-master directory:
python src/graph2vec.py

However, I get the following error:
`Feature extraction started.

100%|████████████████████████████████████████████████████████████████████████| 51/51 [00:05<00:00, 9.59it/s]

Optimization started.

Traceback (most recent call last):
File "src/graph2vec.py", line 129, in
main(args)
File "src/graph2vec.py", line 125, in main
save_embedding(args.output_path, model, graphs, args.dimensions)
File "src/graph2vec.py", line 98, in save_embedding
out.append([int(identifier)] + list(model.docvecs["g_"+identifier]))
File "c:\users\owner\appdata\local\programs\python\python37\lib\site-packages\gensim\models\keyedvectors.py", line 1600, in getitem
raise KeyError("tag '%s' not seen in training corpus/invalid" % index)
KeyError: "tag 'g_0' not seen in training corpus/invalid"`

I edited the default dataset path (from "/" to "\") to make it work with Windows. That's all the edit I've done. As you can see, I'm using Python 3.7. `
Any ideas on what can be causing this error?

Question about the outputed embedding

Hi,

I actually find that if I put several completely same graphs into graph2vec, the outputed embeddings are not the same. To be specific, I replicated several 0.json in /dataset and run the training and I found that the outputed embeddings were different.

I wonder whether this is the correct output.

multiple node feature?

Hi,
how can i use multiple node feature for each node?. In the example only use one.
Thank's.

Input data

Sorry for the question, but I don't know how create a json file as you do.
I have a file.edgelist of my graph and I don't understand how you can create a json file with edges and features.

Can you tell me how I can do it?

Thanks

What does the output file contain

Actually, I don't quite understand what the data in the output "nci.csv" represent

RuntimeError: you must first build vocabulary before training the model

Hello! I was trying to use my own dataset, with the format that you give on the exmaple. But by running the code, i get the following error: "RuntimeError: you must first build vocabulary before training the model". Here I show a sample of the dataset.

{"edges": [[0, 16], [0, 14], [1, 2], [1, 2], [2, 1], [2, 16], [3, 4], [3, 2], [4, 8], [4, 3], [5, 1], [5, 17], [6, 7], [6, 11], [7, 11], [7, 6], [8, 4], [8, 11], [9, 12], [9, 13], [10, 13], [10, 9], [11, 7], [11, 8], [12, 9], [12, 13], [13, 10], [13, 12], [14, 16], [14, 0], [15, 0], [15, 14], [16, 0], [16, 0], [17, 15], [17, 0]], "features": {"0": 4, "1": 5, "2": 28, "3": 28, "4": 28, "5": 37, "6": 40, "7": 48, "8": 28, "9": 40, "10": 28, "11": 41, "12": 32, "13": 40, "14": 11, "15": 32, "16": 28, "17": 36, "18": 20, "19": 50, "20": 23, "21": 56, "22": 11, "23": 36, "24": 20, "25": 48, "26": 23, "27": 48, "28": 4, "29": 5, "30": 11, "31": 12, "32": 4, "33": 4, "34": 14, "35": 20}}

AttributeError: 'DegreeView' object has no attribute 'items'

Hey,
I got a issue when trying to use only a edge list as input. I attached one of my files.
The error is the following

python graph2vec/src/graph2vec.py --input-path graph_data/edge_lists/1/ --output-path nci2.csv

Feature extraction started.

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  9.73it/s]
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "*/venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
    r = call_item()
  File "*/venv/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "*/venv/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 608, in __call__
    return self.func(*args, **kwargs)
  File "*/venv/lib/python3.8/site-packages/joblib/parallel.py", line 255, in __call__
    return [func(*args, **kwargs)
  File "*/venv/lib/python3.8/site-packages/joblib/parallel.py", line 255, in <listcomp>
    return [func(*args, **kwargs)
  File "graph2vec/src/graph2vec.py", line 82, in feature_extractor
    graph, features, name = dataset_reader(path)
  File "graph2vec/src/graph2vec.py", line 72, in dataset_reader
    features = {int(k): v for k, v in features.items()}
AttributeError: 'DegreeView' object has no attribute 'items'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "graph2vec/src/graph2vec.py", line 129, in <module>
    main(args)
  File "graph2vec/src/graph2vec.py", line 112, in main
    document_collections = Parallel(n_jobs=args.workers)(delayed(feature_extractor)(g, args.wl_iterations) for g in tqdm(graphs))
  File "*/venv/lib/python3.8/site-packages/joblib/parallel.py", line 1017, in __call__
    self.retrieve()
  File "*/venv/lib/python3.8/site-packages/joblib/parallel.py", line 909, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "*/venv/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
AttributeError: 'DegreeView' object has no attribute 'items'

I managed to solve it by editing the function dataset_reader:

def dataset_reader(path):
    """
    Function to read the graph and features from a json file.
    :param path: The path to the graph json.
    :return graph: The graph object.
    :return features: Features hash table.
    :return name: Name of the graph.
    """
    name = path.strip(".json").split("/")[-1]
    data = json.load(open(path))
    graph = nx.from_edgelist(data["edges"])

    if "features" in data.keys():
        features = data["features"]
        # moved the following line up because it wont work for the features derived by nx.degree()
        features = {int(k): v for k, v in features.items()}
    else:
        features = nx.degree(graph)
        # added this line features is a list of tuples
        features = {int(feature[0]): feature[1] for feature in features}
    return graph, features, name

I hope it does what the function was supposed to be.
I attached one of my files as example below.
805431435.zip

Different sized edge/feature lists

Hi, I am trying to implement graph2vec on a number of graphs with varying node/feature list lengths. Does this only operate on graphs with the same number of nodes? I tried removing some of the elements in the edge list for one of the sample .json files and this was the result:

Traceback (most recent call last):
  File "\anaconda3\envs\g2vec\lib\site-packages\joblib\externals\loky\process_executor.py", line 418, in _process_worker
    r = call_item()
  File "\anaconda3\envs\g2vec\lib\site-packages\joblib\externals\loky\process_executor.py", line 272, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "\anaconda3\envs\g2vec\lib\site-packages\joblib\_parallel_backends.py", line 567, in __call__
    return self.func(*args, **kwargs)
  File "\anaconda3\envs\g2vec\lib\site-packages\joblib\parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "\anaconda3\envs\g2vec\lib\site-packages\joblib\parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "src\graph2vec.py", line 83, in feature_extractor
    machine = WeisfeilerLehmanMachine(graph, features, rounds)
  File "src\graph2vec.py", line 29, in __init__
    self.do_recursions()
  File "src\graph2vec.py", line 53, in do_recursions
    self.features = self.do_a_recursion()
  File "src\graph2vec.py", line 39, in do_a_recursion
    degs = [self.features[neb] for neb in nebs]
  File "src\graph2vec.py", line 39, in <listcomp>
    degs = [self.features[neb] for neb in nebs]
KeyError: 20
"""
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src\graph2vec.py", line 130, in <module>
    main(args)
  File "src\graph2vec.py", line 113, in main
    document_collections = Parallel(n_jobs=args.workers)(delayed(feature_extractor)(g, args.wl_iterations) for g in tqdm(graphs))
  File "\anaconda3\envs\g2vec\lib\site-packages\joblib\parallel.py", line 930, in __call__
    self.retrieve()
  File "\anaconda3\envs\g2vec\lib\site-packages\joblib\parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "\anaconda3\envs\g2vec\lib\site-packages\joblib\_parallel_backends.py", line 521, in wrap_future_result
    return future.result(timeout=timeout)
  File "\anaconda3\envs\g2vec\lib\concurrent\futures\_base.py", line 405, in result
    return self.__get_result()
  File "\anaconda3\envs\g2vec\lib\concurrent\futures\_base.py", line 357, in __get_result
    raise self._exception
KeyError: 20```

how to get the graph dataset?

Hi,
I have a molecular SMILES strings dataset, but I don't know how to transform this form of data into the graph dataset as mentioned in this repo. Any help would be appreciated.

graph encoding

What are the features? Can a node have more than one feature (string value)? Can edges have features? I am trying to encode semantic structures such as

http://delph-in.github.io/delphin-viz/demo/#input=I%20bought%20a%20book%20yesterday.&count=1&grammar=erg2018-uw&dmrs=true

Does it make sense? At a minimum, nodes have labels, edges have labels. Nodes can also have extra information attached to it (move the mouse over the nodes to see some examples). Can I encode that in graph2vec?

Explanation of params and how to extend to multiple node features

I am having a little trouble understanding the following.

I don't quite get the intuition behind min_count parameter. How do I properly tune these two parameters, any insight on this would be very helpful.
I'm also having a similar problem with the wl_iterations parameter. Any insight on this param would also be really helpful.
Lastly, on a different issue you had mentioned that it's very easy to generalize the node features such that instead of it accepting only one node feature, it could take in multiple. But I'm having trouble implementing this. Pointers to this would also be really helpful!

error while running in windows

hello!
When I run "python src/graph2vec",there is some errors

And then I set breakpoints like this

And I found
f="./dataset\\0.json" and identifier="dataset\\0". That's why int(indentifier) is not correct.
I want to know what identifier should be if program run in a correct way(or in linux)

Variance in resulting embeddings

Hello!
I tried to calculate your embeddings for my own dataset of 300 graphs. The features are nide degrees as suggested in the paper, I only modified the number of epochs making it 40.
Let's say I now have embeddings 1, 2 and 3 of the same set of graphs (so graph2vec applied 3 times on the same data).
Then I tried to calculate map@5 for retrieval of the most similar embeddings between the embeddings 1 and 3, 1 and 2, 3 and 2. And I got really bad results, precisely map@5: 0.249222, 0.260222, 0.163278.
I got similar negative results using the dataset provided and default parameters.
Please, can you comment on that or let me know what I might be doing wrong?
Thank you!

Should I use pretained weights or let it learn by neural network?

I have trained my graphs with graph2vec, Now i want to use in tensorflow model. Should I set trainable = False and use fixed pretrained vectors or should I set trainable = true and let it train?

Thanks

extracted_features might different from the original graph2vec

The original graph2vec uses the feature both prev and next extracted features in WL

But I think this implementation just uses prev extracted features

graph2vec

We have used graph2vec to represent our sub-graphs as fixed length feature vectors for our project.

Using the default parameters, we entered 5 identical subgraphs (see below) but the features produced for the identical graphs (see attached csv file) are vastly different.

Also Kmeans clustering on the features above did not provide tightly clustered outputs for the identical subgraphs. Can you provide us with some insight as to what we may be doing wrong? We appreciate any suggestions you can provide.
i
Thank you so much
Input

0.json