alibaba / graph-learn Goto Github PK

An Industrial Graph Neural Network Framework

License: Apache License 2.0

Shell 0.60% C++ 65.06% C 0.11% Python 28.78% CMake 1.62% Java 3.57% Mustache 0.11% Smarty 0.15%

gnn aligraph graphlearn tensorflow pytorch graph graph-neural-networks gnn-framework dynamic-graph training

graph-learn's Issues

Is python3 supported?

我看有一句话：Otherwise, please refer to the section 'build from source'. 但是一直没法安装成功~ 我这边的环境是python3; 阿里金融云的centos系统。但是运行以下这段话一直不行，我从网页下载master包复制过去之后，git和make操作还是不行。
git clone https://github.com/alibaba/graph-learn.git
cd graph-learn
git submodule update --init
make test
make python

Do graph-learn's client and server correspond to PS and jobs in TensorFlow?

What are the roles of client and server？
About deploy_ Mode:
what is 2 considered relative to 0 and 1

Thank you for the answer！

tensorflow import error

CentOS 7
python 2.7.5
tensorflow 1.12.0

I have built GL from source successfully and passed ./test_cpp_ut.sh. But when I try ./test_python_ut.sh, I got stuck by the import errors.

ImportError: cannot import name abs ./graphlearn/python/tests/test_node_weighted.py Traceback (most recent call last): File "./graphlearn/python/tests/test_node_weighted.py", line 23, in <module> import graphlearn as gl File "/usr/lib64/python2.7/site-packages/graphlearn/__init__.py", line 33, in <module> from graphlearn.python.model.tf import aggregators File "/usr/lib64/python2.7/site-packages/graphlearn/python/model/tf/aggregators/__init__.py", line 20, in <module> from graphlearn.python.model.tf.aggregators.gcn_aggregator import GCNAggregator File "/usr/lib64/python2.7/site-packages/graphlearn/python/model/tf/aggregators/gcn_aggregator.py", line 20, in <module> import tensorflow as tf File "/usr/lib/python2.7/site-packages/tensorflow/__init__.py", line 24, in <module> from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import File "/usr/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 88, in <module> from tensorflow.python import keras File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/__init__.py", line 24, in <module> from tensorflow.python.keras import activations File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/activations/__init__.py", line 22, in <module> from tensorflow.python.keras._impl.keras.activations import elu File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/_impl/keras/__init__.py", line 21, in <module> from tensorflow.python.keras._impl.keras import activations File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/_impl/keras/activations.py", line 23, in <module> from tensorflow.python.keras._impl.keras import backend as K File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/_impl/keras/backend.py", line 38, in <module> from tensorflow.python.layers import base as tf_base_layers File "/usr/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 25, in <module> from tensorflow.python.keras.engine import base_layer File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/engine/__init__.py", line 23, in <module> from tensorflow.python.keras.engine.base_layer import InputSpec File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 35, in <module> from tensorflow.python.keras import backend File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/backend/__init__.py", line 22, in <module> from tensorflow.python.keras._impl.keras.backend import abs

I believe the error comes from inconsistency of TF and it's dependencies according to the THREADS. I have tried many combinations among the packages but still failed.

Could you provide the DETAILED VERSION of TF's dependencies?

Why does the line's loss function use KL-loss alone？

Sigmoid cross entropy loss for unsuperviesd model, And kl-loss used for line.
But according to the formula cross_entropy-entropy=KL, entropy is fixed for the entire training set, so using cross_entropy is the same as using KL.
(Of course, for each specific batch, b changes.)

I don’t quite understand why line uses KL-loss, can you help?

@archwalker @baoleai
Thanks

Installation on Ubuntu 18.04

I try to install graph-learn from source. when I run make python, I get the following error:
My python version is 3.74.

python /home/kanon/code/graph-learn/setup/setup.py bdist_wheel
/home/kanon/anaconda3/lib/python3.7/site-packages/setuptools/dist.py:462: UserWarning: The version specified (b'0.1') is an invalid version, this may not work as expected with newer versions of setuptools, pip, and PyPI. Please see PEP 440 for more details.
  "details." % self.metadata.version
running bdist_wheel
Traceback (most recent call last):
  File "/home/kanon/code/graph-learn/setup/setup.py", line 85, in <module>
    package_data={'': ['python/lib/lib*.so*']},
  File "/home/kanon/anaconda3/lib/python3.7/site-packages/setuptools/__init__.py", line 144, in setup
    return distutils.core.setup(**attrs)
  File "/home/kanon/anaconda3/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/home/kanon/anaconda3/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/home/kanon/anaconda3/lib/python3.7/distutils/dist.py", line 984, in run_command
    cmd_obj.ensure_finalized()
  File "/home/kanon/anaconda3/lib/python3.7/distutils/cmd.py", line 107, in ensure_finalized
    self.finalize_options()
  File "/home/kanon/anaconda3/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 129, in finalize_options
    self.data_dir = self.wheel_dist_name + '.data'
  File "/home/kanon/anaconda3/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 164, in wheel_dist_name
    safer_version(self.distribution.get_version()))
  File "/home/kanon/anaconda3/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 43, in safer_version
    return safe_version(version).replace('-', '_')
  File "/home/kanon/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1333, in safe_version
    return str(packaging.version.Version(version))
  File "/home/kanon/anaconda3/lib/python3.7/site-packages/pkg_resources/_vendor/packaging/version.py", line 200, in __init__
    match = self._regex.search(version)
TypeError: cannot use a string pattern on a bytes-like object
Makefile:317: recipe for target 'python' failed
make: *** [python] Error 1

what's the python version supported?

OS: CentOS 7
python: 3.6

I have encountered several error when I try to build GL from souce. It seems to be the python version error.

the result of 'bipartite_graphsage' doesn't turns out to good

hey guys, thank you for making graph-learn opensourced. however, i have run the demo in bipartite-graphsage, but the result is aweful (with a rather low recall and precision), can you give me some hint on it?

i just follow the instructions on this page:
bipartite_graphsage

Can not find 'train_unsupervised.py' code file in this project?

Where is the 'train_unsupervised.py' code file?

bipartite_graphsage can't train with hops num > 1

I set those config below but get error info AssertionError: ego_spec num must be the same with hops num!

'hops_num': 3,
'u_neighs_num': [5,3,2],
'i_neighs_num': [5,3,2],

Synchronous Training

Great work!
I wonder how graph-learn do synchronous training.
It would be great if there's a distributed synchronous training example.

Worker memory usage keeps increasing when running graphsage dist_train.py

Problem description

When I run the graphsage dist_train.py(cora data), the worker memory usage keeps increasing:

When I train model with our own data, which is a larger graph, the memory usage grows faster:

I guess if there is any memory leak? May be that some objects of the previous iterations are not free? Any advice or suggestions will be greatly appreciated.

Environment information for cora data

docker image: registry.cn-zhangjiakou.aliyuncs.com/pai-image/graph-learn:v0.1-cpu

code path: /workspace/graph-learn/examples/tf/graphsage (in docker container)

config: 2ps, 2worker / batchsize: 32 / epoch: 40000000

Is graph-learn ready for all batch sampling results before training？

Is graph-learn ready for all batch sampling results before training？
For example, if I set up 1000 training times, the strategy of graph learn is to prepare the data of 1000 batches, and then use the next method to get a batch?
In the distributed mode, I first need to randomly select batch_size IDs as source vertices,
and then proceed from these source vertices to further sample.
Are these source vertices selected in the subgraph of each machine or in the whole graph?
If the sub graph is selected, the communication is reduced, but the effect may be reduced.

Thanks!

classical graph algorithms examples

Could you give us a few examples for the classical graph algorithms, such as Pagerank, LPA, WCC, etc.?

differences with alibaba/euler?

Hi,

Thanks for developing this open-source project, I noticed that alibaba also open-sourced euler, can anyone point out the differences between the two ?

Confusion about your DeepWalk implementation

Hi, I am exploring your deep-walk implementation, but I get a little bit confused of the gen_pair function in _positive_sample in your implementation.

Specifically, why does the gl.gen_pair slides the windows between paths, not inside each path?

  >>> path = [np.array([1, 2]), np.array([3, 4]), np.array([5, 6])]
  >>> left_window_size = right_window_size = 1
  >>> src_id, dst_ids = gen_pair(path, left_window_size, right_window_size)
  >>> print print(src_ids, dst_ids)
  >>> (array([1, 2, 3, 4, 3, 4, 5, 6]), array([3, 4, 1, 2, 5, 6, 3, 4]))

The example about starts from 3 nodes {1, 3, 5}, each with a random walk of length 2, which finally collects 3 paths, p1 : [1, 2], p2 : [3, 4], and p3 : [5, 6].

I thought the original DeepWalk paper uses SkipGram on each of these paths, so we shall apply SkipGram on p1, p2 and p3 separately. But the implementation seems to apply SkipGram on the whole set of paths. For example, it pushes all nodes in the path to pair[0] or pair[1].

Why can we do that instead of the original algorithm in the paper? Or am I misunderstanding anything?

Thanks.

Do you have any references on Bipartite GraphSage?

Do you have any references on Bipartite GraphSage? What's the differences between Bipartite GraphSage and classical GraphSage?

Save embeddings issue when distributed training

Hi,

I run the dist_trian.py (examples/tf/graphsage/dist_train.py), it works well. However, when I try to save embedding after training, it raises RuntimeError("Graph is finalized and cannot be modified."). I meet the same issue when I try to run Bipartite GraphSAGE in the distribute mode.

  Traceback (most recent call last):
  File "dist_train.py", line 132, in <module>
    main()
  File "dist_train.py", line 128, in main
    train(config, g)
  File "dist_train.py", line 81, in train
    u_embs = trainer.get_node_embedding("u")
  File "/usr/local/lib/python2.7/dist-packages/graphlearn/python/model/tf/trainer.py", line 57, in get_node_embedding
    ids, emb, iterator = self.model.node_embedding(node_type)

Client talks to multiple servers or single server?

In Distribute runtime design, it shows client will send op request to only 1 server, and server will send partitioned requests.
However, Distributed mode shows client will connect with multiple servers.
I'm a bit confused about client-server mode. does client send op request to multiple servers?
I checked grpc_client.cc. It seems client will connect to 1 server. if so, this picture is not correct.

Installation on CPU with no AVX support

I can install from source successfully by removing -mavx from CXXFLAGS. Could you add an option in Makefile to auto detect AVX support?

GraphSAGE doesn't work with 1 hop.

When setting hops_num 1, neighs_num: [25] in examples/tf/graphsage/train_supervised.py,
it raises AssertionError:

assert self._depth + 1 == len(feature_encoders)

typo: "the" instead of "teh"

graph-learn/graphlearn/python/model/tf/ego_flow.py

Line 37 in 73b9880

receptive_fn: teh function used when performing neighbor sample.

Some questions about dynamic threadpool

In dynamic_worker_threadpool.cc file, i think WaitForNotify() are used to several scenarios like:
1.between push and pop idle_threads_stack, add_task() firstly pop the stack, which make pinfo !=info, and set() before wait(), so thread don't need to wait.
2.pinfo ==info, thread automatically loop for tasks.
3.when pinfo !=info, the set() make another thread wake and continue to work(by condition signal).
So i may have confused by 2 questions:
1.before shutdown(), it seems like each time at least one thread must be active, because each thread always waked from set() which set is_set to true, so when WaitForIdle() could return?
2.Shutdown() set stopped_ to false to make active thread breaking the loop to complete. Why need to maintain the one last thread to wait for event_for_all_workers_exit_ signal? I think that when thread_num_ decrease to zero, event_for_all_workers_exit_ wait() could be return directly.

installation on CentOS 7

I got an error when I try to install from wheel file on CentOS 7. Do you have any plan on CentOS?

ERROR: graphlearn-0.1-cp27-cp27mu-linux_x86_64.whl is not a supported wheel on this platform.

Is python3 supported?

如题

aggregator: C++ version vs Python version

I find graph-learn has aggregator with both c++ and python version (mean, sum...).
Why Graph-Learn need c++ version aggregator?

servers would be hang when change inter_thread_num

Hi
when i test performance using graph-learn framework and set inter_thread_num equal to 64 or greater using gl.set_inter_threadnum(64), all the servers are hang during initializing graph data and workers are waiting for servers ready

graphsage train_unsupervised.py get killed while save embeddings

Dear developpers,

When I run the graphsage ppi training exmaple, the script get killed while save embeddings:

Epoch 00, Iteration 110, Time(s) 2.8076, Loss 1.26855
Epoch 00, Iteration 111, Time(s) 0.6244, Loss 1.25834
save embedding...
(a few minutes later)
Killed

I confirm the ./id_emb exist on the right path and there're 18G space availabe on the disk.

Could you kindly let me know what's the possible reason and how to fix this issue ?

Thanks very much!

Kevin

[Question] Implementing my own embedding in AliGraph

Hello AliGraph people!

I read the VLDB 2019 paper and I found it very interesting. It is great that the project is open source.

I am wondering if I can implement my own embedding algorithm on top of your system. For example can I implement DeepWalk in AliGraph?

If yes, do you provide abstractions for defining random-walk operations in AliGraph?
If no, is it possible to build such abstractions in your system?

Thanks in advance.

Best,
Makis

ResourceExhaustedError in distributed mode

If I change neighs_num to [25,10], and batch_size to 100 in this file:
https://github.com/alibaba/graph-learn/blob/master/examples/tf/graphsage/dist_train.py

I see the ResourceExhaustedError and it seems related to the graph engine operations. I am not seeing any CPU, RAM bottleneck.

I run 2 parameter servers and 2 workers as shown in the wiki.

The error is:

23:56:40.288581 61070 notification.cc:194] RpcNotification:Failed req_type:LookupNodes status:Resource exhausted:Received message larger than max (8066392 vs. 4194304)
23:56:40.322116 61070 distribute_runner.h:125] Rpc failed:Resource exhausted:Received message larger than max (8066392 vs. 4194304)name:LookupNodes

Any idea how to fix this?

How to partition the graph in distributed training?

Hello. I'm interested in how to use METIS partitioning algorithm with GL. Is there any tutorial or doc? Thank you!

Questions about graphsage dist_train.py

First of all, thank you guys for opensource such an amazing project.

I try to follow THIS manual play with distributed training on a single machine, but fail to start training process.

Here is my script to start ps and worker process.

PS_HOSTS="127.0.0.1:2300,127.0.0.1:2311"
WK_HOSTS="127.0.0.1:2200,127.0.0.1:2222"

python dist_train.py \
  --tracker=./distributed \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=ps \
  --task_index=0 &

python dist_train.py \
  --tracker=./distributed \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=worker \
  --task_index=0 &

python dist_train.py \
  --tracker=./distributed \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=ps \
  --task_index=1 &

python dist_train.py \
  --tracker=./distributed \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=worker \
  --task_index=1 &

wait

And also I add some log in Graph.init() function( https://github.com/alibaba/graph-learn/blob/master/graphlearn/python/graph.py ), but can not see "############# Server init done #############" been printout.

    if job_name == "client":
      pywrap.set_client_id(task_index)
      self._client = pywrap.rpc_client()
      self._server = None
    else:
      print("############# Server init start #############")
      if job_name == "server":
        self._client = None
      if not tracker and kwargs.get("tracker"):
        tracker = kwargs["tracker"]
      if tracker:
        self._server = Server(task_index, server_count, tracker)
      else:
        self._server = Server(task_index, server_count)
      self._server.start()
      print("############# Server start done #############")
      self._server.init(self._edge_sources, self._node_sources)
      print("############# Server init done #############")
    return self

Anything I can get list below, it's keep printing Invalid endpoint file: 0 till the end of the world.

main                                                                                                                                        
WARNING: Logging before InitGoogleLogging() is written to STDERR                                                                            
I0402 13:10:49.755939 10816 naming_engine.cc:56] Connect naming engine ok: ./distributed/endpoints/
I0402 13:10:49.756223 10816 channel_manager.cc:94] Auto select server: 1  
W0402 13:10:49.756240 10816 channel_manager.cc:100] Waiting for all servers started: 0/2
W0402 13:10:49.756494 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:49.756530 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:49.756541 10904 naming_engine.cc:159] Refresh endpoints count: 0
2020-04-02 13:10:49.771019: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA                                                                                                            
2020-04-02 13:10:49.777325: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:2300, 1 -> 127.0.0.1:2311}                                                                                                          
2020-04-02 13:10:49.777366: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:2200, 1 -> localhost:2222}                                                                                                      
2020-04-02 13:10:49.784454: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:2222                                                                                                                                          
main                                                                                                                                        
WARNING: Logging before InitGoogleLogging() is written to STDERR                                                                            
I0402 13:10:49.878661 10814 naming_engine.cc:56] Connect naming engine ok: ./distributed/endpoints/
I0402 13:10:49.878902 10814 channel_manager.cc:94] Auto select server: 0  
W0402 13:10:49.878921 10814 channel_manager.cc:100] Waiting for all servers started: 0/2
W0402 13:10:49.880380 10951 naming_engine.cc:154] Invalid endpoint file: 0
main                                                                                                                                        
W0402 13:10:49.880429 10951 naming_engine.cc:154] Invalid endpoint file: 1  
I0402 13:10:49.880441 10951 naming_engine.cc:159] Refresh endpoints count: 0
############# Server init start #############                                                                                               
2020-04-02 13:10:49.894944: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA                                                                                                            
2020-04-02 13:10:49.900562: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:2300, 1 -> 127.0.0.1:2311}                                                                                                          
2020-04-02 13:10:49.900591: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2200, 1 -> 127.0.0.1:2222}                                                                                                      
2020-04-02 13:10:49.901519: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:2200                                                                                                                                          
main                                                                                                                                        
############# Server init start #############                                                                                               
W0402 13:10:50.756636 10904 naming_engine.cc:154] Invalid endpoint file: 0  
W0402 13:10:50.756687 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:50.756696 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:50.880582 10951 naming_engine.cc:154] Invalid endpoint file: 0  
W0402 13:10:50.880635 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:50.880697 10951 naming_engine.cc:159] Refresh endpoints count: 0
[2020-04-02 13:10:50.888773] Server started.                                                                                                
############# Server start done #############                                                                                                                                                                                                                                            
[2020-04-02 13:10:50.985136] Server started.                                                                                                                                                                                                                                             
############# Server start done #############                                                                                                                                                                                                                                            
W0402 13:10:51.756803 10904 naming_engine.cc:154] Invalid endpoint file: 0                                                                                                                                                                                                               
W0402 13:10:51.756860 10904 naming_engine.cc:154] Invalid endpoint file: 1                                                                  
I0402 13:10:51.756868 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:51.880851 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:51.880900 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:51.880908 10951 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:52.756978 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:52.757043 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:52.757053 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:52.881058 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:52.881108 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:52.881115 10951 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:53.757174 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:53.757233 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:53.757244 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:53.881242 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:53.881289 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:53.881297 10951 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:54.757366 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:54.757421 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:54.757429 10904 naming_engine.cc:159] Refresh endpoints count: 0

Any clue? Thank you!

How to include textual data as node attributes

Say if a node has multiple aspects of textual descriptions, one way is to put them as multiple attributes for a node separated using delimiters like a colon. E.g.
id:int64 attribute:string
10001 the color is blue:round shape:it's very nice and expensive

However, if the text itself contains a colon, the split would break. What's the best way to input multiple text attributes to graph-learn? Separate then by "\t" in a line would break the code. What about putting text attributes into multiple node files (one file has one attribute)? Would that be supported?

I understand text needs to be further encoded by custom encoders, which I plan to implement.

Improve exception safety with smart pointers

Would you like to wrap any pointer data members with the class template “std::unique_ptr”?

How to properly link the gflags?

Building the python wheel from master branch and run gcn model will get

python train_supervised.pyTraceback (most recent call last):
  File "train_supervised.py", line 22, in <module>
    import graphlearn as gl
  File "/home/cn/research/graph/venv/lib/python2.7/site-packages/graphlearn/__init__.py", line 16, in <module>
    from graphlearn import pywrap_graphlearn as pywrap
ImportError: /home/cn/research/graph/graph-learn/built/lib/libgraphlearn_shared.so: undefined symbol: _ZN6google14FlagRegistererC1IiEEPKcS3_S3_PT_S5_

It seems that symbol _ZN6google14FlagRegistererC1IiEEPKcS3_S3_PT_S5_ (which is the mangled FlagRegisterer from gflags package ) was not found.

So, I tried to modify the Makefile and link gflags library manually, then the problem was solved.(modified line: here)

so:protobuf grpc glog gtest proto common platform service core
	@mkdir -p $(INCLUDE_DIR)
	@mkdir -p $(LIB_DIR)
	@mkdir -p $(BIN_DIR)
	$(CXX) $(CXXFLAGS) -shared $(PROTO_OBJ) $(COMMON_OBJ) $(PLATFORM_OBJ) $(SERVICE_OBJ) $(CORE_OBJ) \
		-L$(ROOT) -L$(GLOG_LIB) -L$(PROTOBUF_LIB) -L$(GRPC_LIB) -L$(GFLAGS_LIB)\
		-lglog -lprotobuf -lgrpc++ -lgrpc -lgpr -lupb -lgflags\
		-o $(LIB_DIR)/libgraphlearn_shared.so

Is this a bug or one need special way not mentioned in doc so far to link gflags ?

Any plan to decouple TF-PS and distributed graph engine？

In the current implementation，client and server of graph co-place with TF-worker and TF-parameter-server.

When I want to use one TF-worker to train and multiple workers to sample data simultaneously(for GPU training). There will be some restrictions under the current architecture. So, any plan to decouple TF-PS and distributed graph engine to make architecture more flexible?

Build error with protobuf branch 3.10.x

CXX google/protobuf/text_format.lo
google/protobuf/text_format.cc: In member function ‘virtual void google::protobuf::TextFormat::FastFieldValuePrinter::PrintFloat(float, google::protobuf::TextFormat::BaseTextGenerator*) const’:
google/protobuf/text_format.cc:1623:27: error: ‘__builtin_isnan’ is not a member of ‘std’
generator->PrintString(!std::isnan(val) ? SimpleFtoa(val) : "nan");
^
google/protobuf/text_format.cc:1623:27: note: suggested alternative:
: note: ‘__builtin_isnan’
google/protobuf/text_format.cc: In member function ‘virtual void google::protobuf::TextFormat::FastFieldValuePrinter::PrintDouble(double, google::protobuf::TextFormat::BaseTextGenerator*) const’:
google/protobuf/text_format.cc:1627:27: error: ‘__builtin_isnan’ is not a member of ‘std’
generator->PrintString(!std::isnan(val) ? SimpleDtoa(val) : "nan");
^
google/protobuf/text_format.cc:1627:27: note: suggested alternative:
: note: ‘__builtin_isnan’
Makefile:4019: recipe for target 'google/protobuf/text_format.lo' failed
make[1]: *** [google/protobuf/text_format.lo] Error 1
make[1]: Leaving directory '/home/xxxx/Repo/graph-learn/third_party/protobuf/protobuf/src'
Makefile:1723: recipe for target 'install-recursive' failed
make: *** [install-recursive] Error 1

how to train with hdfs source data, and how to save embedding

I don't see any demo about gcn or graph algorithm trained with hdfs data, and pull the node embedding to local or hdfs for use. would you add some, thank you!

how to deal with boundary problem

As we know from the paper, graph-learn have implemented four built-in graph partition algorithms to minimize the number of crossing edges whose endpoints are in different workers, but there still may be some crossing edges exsist after graph partition.

for the Graph as follow(3-hop):

hop-1: get vertex 1 neighbor(vertex 3) by adj_matrix
hop-2: sampling for vertex 3 need to send request from server 1 to server 2 and get vertex 3 neighbor(vertex 2) by adj_matrix
hop-3: sampling for vertex 2 need to send request from server 2 to server 1

I want to know graph-learn how to deal with this bounding problem or provide optimization methods to avoid this situation?

How to dump memory trace?

Dear developers,

I want to analyze the memory access profile of graph-learn as a study to optimize the performance.

Could you let me know how to dump the memory trace ?

Thanks very much!

Kevin

id must be int64?

In my project ,my id is bank card which length is 21.Is there any way to solve this problem?thanks！

Can not read attribute from data file.

I parser my dataset to formated file like this:
node:

id:int64	attribute:string
3916	c
1819	c
4501	c

edge:

src_id:64	dst_id:int64	attribute:string
4	6	p
4	7	p
4	8	p
4	9	j

then I run commands:

g = g.node(source="data/graph_data/node0.txt", node_type="entry", decoder=gl.Decoder(attr_types=["string"]))
g = g.edge(source="data/graph_data/edge0.txt", edge_type = ("entry", "entry", "action"), decoder=gl.Decoder(attr_types=["string"]))
g = g.init()

when I tried to get node's attibute, I found it is empty:

In [11]: res.__dict__
Out[11]: 
{'_attred': False,
 '_float_attrs': None,
 '_graph': <graphlearn.python.graph.Graph at 0x7fc178e71d90>,
 '_ids': array([   4, 3916]),
 '_int_attrs': None,
 '_labels': None,
 '_shape': (2,),
 '_string_attrs': None,
 '_type': 'entry',
 '_weights': None}

Is there any step wrong?

Will you release the detailed API document in the future?

Data partition issue when distributed training

Hi there,

Recently, when I use dist_train.py (examples/tf/graphsage/dist_train.py) to test distributed mode, I found that the number of iterations of each worker is quite different(in my opinion, it should be roughly equal for each worker).

Problem Detail:

The PPI dataset containing 56,944 nodes and 818,717 edges. Suppose I set the batch size to 100, so there should be 570 iters(node-based sampler) in one-epoch-training-scheduler.

When I use 1 ps with 2 workers configure for distributed training, worker-0 runs 284 iters, but worker-1 runs 856 iters, the sum iters of two workers is 1140(2 * 570). The data appears to have been accessed twice in one epoch.
When I use 2 ps with 1 worker configure(unusual setting, just for the experiment) for distributed training, worker-0 only runs 285 iters(570 / 2). Similarly, when I use 4 ps with 2 workers, worker-0 runs 143 iters, and worker-1 runs 143 iters. Half of the data is not used.

Some thoughts

After I read some source codes, I guess problem-1 may be caused by node_getter.cc(graphlearn/core/operator/graph/node_getter.cc)'s shared state.

When a client sends a node getter request, the NodeGetter OP will lock the DataStorage, so there will not be multi-requests reading the same data(thread-safety).

But when the cursor reaches the end of data, it will raise an OutOfRangeError to the client, and reset the cursor to 0. But for other workers which also connected to the same server, it will not receive the OutOfRangeError signal, so when they build a new request to get nodes, the shared state's cursor is already re-init to 0. So like problem-1's result, when worker-0 reaches the end of data, it will receive an OutOfRangeError and the server will reset the cursor to 0. Then, worker-0 will finish the training process. For worker-1, it will re-start from 0 to 569 to traverse the whole dataset.

Summary:

When the number of workers is greater than the number of servers(ps), the server will reset shared state many(the number of workers connected to the server) times. In fact, the reset times should only depend on epoch-setting. Finally, it will cause the number of iterators is actually greater than expected.
When the number of workers is lesser than the number of servers(ps), the nodes will partition to n_server parts(by hash, for now), but only the number of workers parts will be used in training. It may be caused by each client is only connected to one fixed server?

Why using `sim_function` for positive samples, `tf.multiply` for negative ones in the loss function?

pos_logit用sim_function, 为什么 neg_logit 一直用tf.multiply呢？
pos_logit = sim_function(src_emb, pos_emb)

src_emb_exp = tf.tile(tf.expand_dims(src_emb, axis=1),
[1, per_sample_neg_num, 1])
src_emb_exp = tf.reshape(src_emb_exp, [-1, emb_dim])
neg_logit = tf.reduce_sum(tf.multiply(src_emb_exp, neg_emb), axis=-1
true_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.ones_like(pos_logit), logits=pos_logit)
negative_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.zeros_like(neg_logit), logits=neg_logit)

loss = tf.reduce_mean(true_xent) + 1.0 * tf.reduce_mean(negative_xent)
logit = tf.concat([pos_logit, neg_logit], axis=-1)
label = tf.concat([tf.ones_like(pos_logit, dtype=tf.int32),
tf.zeros_like(neg_logit, dtype=tf.int32)], axis=-1)

Any plan to sample with node attribute?

An example

g.V("user").with_attribute("gender=man").batch(64).outV("buy").sample(2).with_attribute("city=1").by("random")

Wish clarification about two optimization strategies.

How to balance DNN computation on GPU and sampling compuation on cpu in graph-learn, if GPU is fast and data provided by CPU sampling is not fast enough? Generally, We will use latency hidden skill to prefetch and buffer samples that samped by CPU.

If this is not resolved, GPU will not be fully used in some situations.

https://github.com/alibaba/graph-learn/tree/master/graphlearn/core/operator/aggregator really works? After reading https://arxiv.org/pdf/1902.08730.pdf and these source codes, I really get confused with aggregator implementation. It seem that all GCN layers just use tensorflow native operators (https://github.com/alibaba/graph-learn/tree/master/graphlearn/python/model/tf/aggregators) instead of core/operator/aggregator to do aggregator computations described in Algorithm 1: GNN Framework https://arxiv.org/pdf/1902.08730.pdf.

Wish better clarification these trouble, thanks a lot.

GraphSage dist_train.py training problem.

Hi, when I try to launch a distributed training for GraphSage, I check the output log, after every iteration, I find the following ERROR output

Epoch 38, Iteration 0, Time(s) 0.0830, Loss 0.86335
Epoch 38, Iteration 1, Time(s) 0.0877, Loss 0.59674
Epoch 38, Iteration 2, Time(s) 0.0899, Loss 0.54290
Epoch 38, Iteration 3, Time(s) 0.0685, Loss 0.71597
Epoch 38, Iteration 4, Time(s) 0.0743, Loss 0.84707
Epoch 38, Iteration 5, Time(s) 0.0781, Loss 0.49838
Epoch 38, Iteration 6, Time(s) 0.0681, Loss 0.77587
E0717 17:30:22.467953   589 notification.cc:194] RpcNotification:Failed	req_type:GetNodes	status:Out of range:No more nodes exist.
E0717 17:30:22.468039   589 distribute_runner.h:125] Rpc failed:Out of range:No more nodes exist.name:GetNodes

Currently, I am setting up the two (server+client) at the same physical server with different ports.

Could you please help me to solve this problem?

Thanks!

Standard Recommender Dataset on Bipartite GraphSage

Hi! First of all, thanks for releasing graph-learn!

Regarding the bipartite version of GraphSage, I am aware that you use the u2i.zip dataset, I have successfully run the model on that dummy dataset without issues.

I do believe though that the u2i dataset does not include any distinct node features whatsoever, much less feature vectors of different length (depending on the node type, e.g. users or items).

Have you tested the model on a standard recommendation dataset, like MovieLens? If yes, does it work out-of-the-box? I haven't really got to try it out on the dataset myself, just checking if indeed the model supports fully fledged node features, especially of different lengths.

Thanks in advance!

Still can not run graphsage dist_train locally #4

After patch the fix(#4 #11), I rebuild/reinstall graph-learn and use below commands to start dist_train.py, problem still.

PS_HOSTS="127.0.0.1:2300,127.0.0.1:2311"
WK_HOSTS="127.0.0.1:2200,127.0.0.1:2222"

TRACK_DIR="/tmp/graphlearn/"
rm -rf ${TRACK_DIR}
mkdir -p ${TRACK_DIR}

python dist_train.py \
  --tracker=${TRACK_DIR} \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=ps \
  --task_index=0 &

sleep 2

python dist_train.py \
  --tracker=${TRACK_DIR} \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=worker \
  --task_index=0 &

sleep 2

python dist_train.py \
  --tracker=${TRACK_DIR} \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=ps \
  --task_index=1 &

sleep 2

python dist_train.py \
  --tracker=${TRACK_DIR} \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=worker \
  --task_index=1 &

wait

Stdout&Stderr
stdout&stderr.txt

Server-Logs:
graphlearn.VM_10_224_centos.ced.log.WARNING.20200405-112725.21131.log
graphlearn.VM_10_224_centos.ced.log.WARNING.20200405-112721.21023.log
graphlearn.VM_10_224_centos.ced.log.INFO.20200405-112721.21023.log
graphlearn.VM_10_224_centos.ced.log.INFO.20200405-112725.21131.log

issue in document

https://github.com/alibaba/graph-learn/blob/master/docs/query.md#inneg

inNeg(edge_type). Nodes to Nodes. Traverse to the source negative node along the edge. The edge must be undirected.

For exmaple, the topology is nodeA--(edge)-->nodeB, nodeB.outNeg(edge) is nodeA.

SHOULD BE "nodeB.inNeg(edge)" is nodeA？