Coder Social home page Coder Social logo

alibaba / graph-learn Goto Github PK

View Code? Open in Web Editor NEW
1.3K 49.0 266.0 5.65 MB

An Industrial Graph Neural Network Framework

License: Apache License 2.0

Shell 0.60% C++ 65.06% C 0.11% Python 28.78% CMake 1.62% Java 3.57% Mustache 0.11% Smarty 0.15%
gnn aligraph graphlearn tensorflow pytorch graph graph-neural-networks gnn-framework dynamic-graph training

graph-learn's Introduction

GL graphlearn-training-pypi docs graph-learn CI License

News

Our GNN acceleration library for PyTorch is now available. https://github.com/alibaba/graphlearn-for-pytorch

Documentation

简体中文 | English

Graph-Learn (formerly AliGraph) is a distributed framework designed for the development and application of large-scale graph neural networks. It has been successfully applied to many scenarios within Alibaba, such as search recommendation, network security, and knowledge graph. After Graph-Learn 1.0, we added online inference services to the Graph-Learn framework, providing a complete solution including training and inference for GNNs to be used in real business.

  • GraphLearn-Training

    The training framework, supports sampling on batch graphs, training offline or incremental GNN models.

    It provides both Python and C++ interfaces for graph sampling operations and provides a gremlin-like GSL (Graph Sampling Language) interface. For GNN models, Graph-Learn provides a set of paradigms and processes for model development. It is compatible with both TensorFlow and PyTorch, and provides data layer, model layer interfaces and rich model examples.

    Detail

  • Dynamic-Graph-Service

    An online inference service, supports real-time sampling on dynamic graphs with streaming graph updates.

    It provides a performance guarantee of sampling P99 latency in 20ms on large-scale dynamic graphs. The Client side of the Online Inference Service provides Java GSL interfaces and Tensorflow Model Predict.

    Detail

Use GraphLearn-Training and Dynamic-Graph-Service for training and inference.

overview

  1. A user initiates a request on the Web (0), samples in real time on the dynamic graph via the Client side (1), uses the samples as model input, and requests the prediction results from the Model service (3).
  2. the prediction results, feedback, and some context on the Web are sent to the Data Hub (0, 3), eg, Log Service.
  3. data updates streamingly flow into the Dynamic Graph Service as graph updates (4).
  4. GraphLearn-Training hourly loads window of graph data, incremental trains models, and updates model on tensorflow Model service.

Citation

Please cite the following paper in your publications if Graph-Learn helps your research.

@article{zhu2019aligraph,
  title={AliGraph: a comprehensive graph neural network platform},
  author={Zhu, Rong and Zhao, Kun and Yang, Hongxia and Lin, Wei and Zhou, Chang and Ai, Baole and Li, Yong and Zhou, Jingren},
  journal={Proceedings of the VLDB Endowment},
  volume={12},
  number={12},
  pages={2094--2105},
  year={2019},
  publisher={VLDB Endowment}
}

License

Apache License 2.0.

graph-learn's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

graph-learn's Issues

Questions about graphsage dist_train.py

First of all, thank you guys for opensource such an amazing project.

I try to follow THIS manual play with distributed training on a single machine, but fail to start training process.

Here is my script to start ps and worker process.

PS_HOSTS="127.0.0.1:2300,127.0.0.1:2311"
WK_HOSTS="127.0.0.1:2200,127.0.0.1:2222"

python dist_train.py \
  --tracker=./distributed \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=ps \
  --task_index=0 &

python dist_train.py \
  --tracker=./distributed \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=worker \
  --task_index=0 &

python dist_train.py \
  --tracker=./distributed \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=ps \
  --task_index=1 &

python dist_train.py \
  --tracker=./distributed \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=worker \
  --task_index=1 &

wait

And also I add some log in Graph.init() function( https://github.com/alibaba/graph-learn/blob/master/graphlearn/python/graph.py ), but can not see "############# Server init done #############" been printout.

    if job_name == "client":
      pywrap.set_client_id(task_index)
      self._client = pywrap.rpc_client()
      self._server = None
    else:
      print("############# Server init start #############")
      if job_name == "server":
        self._client = None
      if not tracker and kwargs.get("tracker"):
        tracker = kwargs["tracker"]
      if tracker:
        self._server = Server(task_index, server_count, tracker)
      else:
        self._server = Server(task_index, server_count)
      self._server.start()
      print("############# Server start done #############")
      self._server.init(self._edge_sources, self._node_sources)
      print("############# Server init done #############")
    return self

Anything I can get list below, it's keep printing Invalid endpoint file: 0 till the end of the world.

main                                                                                                                                        
WARNING: Logging before InitGoogleLogging() is written to STDERR                                                                            
I0402 13:10:49.755939 10816 naming_engine.cc:56] Connect naming engine ok: ./distributed/endpoints/
I0402 13:10:49.756223 10816 channel_manager.cc:94] Auto select server: 1  
W0402 13:10:49.756240 10816 channel_manager.cc:100] Waiting for all servers started: 0/2
W0402 13:10:49.756494 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:49.756530 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:49.756541 10904 naming_engine.cc:159] Refresh endpoints count: 0
2020-04-02 13:10:49.771019: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA                                                                                                            
2020-04-02 13:10:49.777325: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:2300, 1 -> 127.0.0.1:2311}                                                                                                          
2020-04-02 13:10:49.777366: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:2200, 1 -> localhost:2222}                                                                                                      
2020-04-02 13:10:49.784454: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:2222                                                                                                                                          
main                                                                                                                                        
WARNING: Logging before InitGoogleLogging() is written to STDERR                                                                            
I0402 13:10:49.878661 10814 naming_engine.cc:56] Connect naming engine ok: ./distributed/endpoints/
I0402 13:10:49.878902 10814 channel_manager.cc:94] Auto select server: 0  
W0402 13:10:49.878921 10814 channel_manager.cc:100] Waiting for all servers started: 0/2
W0402 13:10:49.880380 10951 naming_engine.cc:154] Invalid endpoint file: 0
main                                                                                                                                        
W0402 13:10:49.880429 10951 naming_engine.cc:154] Invalid endpoint file: 1  
I0402 13:10:49.880441 10951 naming_engine.cc:159] Refresh endpoints count: 0
############# Server init start #############                                                                                               
2020-04-02 13:10:49.894944: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA                                                                                                            
2020-04-02 13:10:49.900562: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:2300, 1 -> 127.0.0.1:2311}                                                                                                          
2020-04-02 13:10:49.900591: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2200, 1 -> 127.0.0.1:2222}                                                                                                      
2020-04-02 13:10:49.901519: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:2200                                                                                                                                          
main                                                                                                                                        
############# Server init start #############                                                                                               
W0402 13:10:50.756636 10904 naming_engine.cc:154] Invalid endpoint file: 0  
W0402 13:10:50.756687 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:50.756696 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:50.880582 10951 naming_engine.cc:154] Invalid endpoint file: 0  
W0402 13:10:50.880635 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:50.880697 10951 naming_engine.cc:159] Refresh endpoints count: 0
[2020-04-02 13:10:50.888773] Server started.                                                                                                
############# Server start done #############                                                                                                                                                                                                                                            
[2020-04-02 13:10:50.985136] Server started.                                                                                                                                                                                                                                             
############# Server start done #############                                                                                                                                                                                                                                            
W0402 13:10:51.756803 10904 naming_engine.cc:154] Invalid endpoint file: 0                                                                                                                                                                                                               
W0402 13:10:51.756860 10904 naming_engine.cc:154] Invalid endpoint file: 1                                                                  
I0402 13:10:51.756868 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:51.880851 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:51.880900 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:51.880908 10951 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:52.756978 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:52.757043 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:52.757053 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:52.881058 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:52.881108 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:52.881115 10951 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:53.757174 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:53.757233 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:53.757244 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:53.881242 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:53.881289 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:53.881297 10951 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:54.757366 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:54.757421 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:54.757429 10904 naming_engine.cc:159] Refresh endpoints count: 0

Any clue? Thank you!

Can not read attribute from data file.

I parser my dataset to formated file like this:
node:

id:int64	attribute:string
3916	c
1819	c
4501	c

edge:

src_id:64	dst_id:int64	attribute:string
4	6	p
4	7	p
4	8	p
4	9	j

then I run commands:

g = g.node(source="data/graph_data/node0.txt", node_type="entry", decoder=gl.Decoder(attr_types=["string"]))
g = g.edge(source="data/graph_data/edge0.txt", edge_type = ("entry", "entry", "action"), decoder=gl.Decoder(attr_types=["string"]))
g = g.init()

when I tried to get node's attibute, I found it is empty:

In [11]: res.__dict__
Out[11]: 
{'_attred': False,
 '_float_attrs': None,
 '_graph': <graphlearn.python.graph.Graph at 0x7fc178e71d90>,
 '_ids': array([   4, 3916]),
 '_int_attrs': None,
 '_labels': None,
 '_shape': (2,),
 '_string_attrs': None,
 '_type': 'entry',
 '_weights': None}

Is there any step wrong?

installation on CentOS 7

I got an error when I try to install from wheel file on CentOS 7. Do you have any plan on CentOS?

ERROR: graphlearn-0.1-cp27-cp27mu-linux_x86_64.whl is not a supported wheel on this platform.

How to properly link the gflags?

Building the python wheel from master branch and run gcn model will get

python train_supervised.pyTraceback (most recent call last):
  File "train_supervised.py", line 22, in <module>
    import graphlearn as gl
  File "/home/cn/research/graph/venv/lib/python2.7/site-packages/graphlearn/__init__.py", line 16, in <module>
    from graphlearn import pywrap_graphlearn as pywrap
ImportError: /home/cn/research/graph/graph-learn/built/lib/libgraphlearn_shared.so: undefined symbol: _ZN6google14FlagRegistererC1IiEEPKcS3_S3_PT_S5_

It seems that symbol _ZN6google14FlagRegistererC1IiEEPKcS3_S3_PT_S5_ (which is the mangled FlagRegisterer from gflags package ) was not found.

So, I tried to modify the Makefile and link gflags library manually, then the problem was solved.(modified line: here)

so:protobuf grpc glog gtest proto common platform service core
	@mkdir -p $(INCLUDE_DIR)
	@mkdir -p $(LIB_DIR)
	@mkdir -p $(BIN_DIR)
	$(CXX) $(CXXFLAGS) -shared $(PROTO_OBJ) $(COMMON_OBJ) $(PLATFORM_OBJ) $(SERVICE_OBJ) $(CORE_OBJ) \
		-L$(ROOT) -L$(GLOG_LIB) -L$(PROTOBUF_LIB) -L$(GRPC_LIB) -L$(GFLAGS_LIB)\
		-lglog -lprotobuf -lgrpc++ -lgrpc -lgpr -lupb -lgflags\
		-o $(LIB_DIR)/libgraphlearn_shared.so

Is this a bug or one need special way not mentioned in doc so far to link gflags ?

Installation on Ubuntu 18.04

I try to install graph-learn from source. when I run make python, I get the following error:
My python version is 3.74.

python /home/kanon/code/graph-learn/setup/setup.py bdist_wheel
/home/kanon/anaconda3/lib/python3.7/site-packages/setuptools/dist.py:462: UserWarning: The version specified (b'0.1') is an invalid version, this may not work as expected with newer versions of setuptools, pip, and PyPI. Please see PEP 440 for more details.
  "details." % self.metadata.version
running bdist_wheel
Traceback (most recent call last):
  File "/home/kanon/code/graph-learn/setup/setup.py", line 85, in <module>
    package_data={'': ['python/lib/lib*.so*']},
  File "/home/kanon/anaconda3/lib/python3.7/site-packages/setuptools/__init__.py", line 144, in setup
    return distutils.core.setup(**attrs)
  File "/home/kanon/anaconda3/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/home/kanon/anaconda3/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/home/kanon/anaconda3/lib/python3.7/distutils/dist.py", line 984, in run_command
    cmd_obj.ensure_finalized()
  File "/home/kanon/anaconda3/lib/python3.7/distutils/cmd.py", line 107, in ensure_finalized
    self.finalize_options()
  File "/home/kanon/anaconda3/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 129, in finalize_options
    self.data_dir = self.wheel_dist_name + '.data'
  File "/home/kanon/anaconda3/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 164, in wheel_dist_name
    safer_version(self.distribution.get_version()))
  File "/home/kanon/anaconda3/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 43, in safer_version
    return safe_version(version).replace('-', '_')
  File "/home/kanon/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1333, in safe_version
    return str(packaging.version.Version(version))
  File "/home/kanon/anaconda3/lib/python3.7/site-packages/pkg_resources/_vendor/packaging/version.py", line 200, in __init__
    match = self._regex.search(version)
TypeError: cannot use a string pattern on a bytes-like object
Makefile:317: recipe for target 'python' failed
make: *** [python] Error 1

id must be int64?

In my project ,my id is bank card which length is 21.Is there any way to solve this problem?thanks!

Why using `sim_function` for positive samples, `tf.multiply` for negative ones in the loss function?

pos_logit用sim_function, 为什么 neg_logit 一直用tf.multiply呢?
pos_logit = sim_function(src_emb, pos_emb)

src_emb_exp = tf.tile(tf.expand_dims(src_emb, axis=1),
[1, per_sample_neg_num, 1])
src_emb_exp = tf.reshape(src_emb_exp, [-1, emb_dim])
neg_logit = tf.reduce_sum(tf.multiply(src_emb_exp, neg_emb), axis=-1
true_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.ones_like(pos_logit), logits=pos_logit)
negative_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.zeros_like(neg_logit), logits=neg_logit)

loss = tf.reduce_mean(true_xent) + 1.0 * tf.reduce_mean(negative_xent)
logit = tf.concat([pos_logit, neg_logit], axis=-1)
label = tf.concat([tf.ones_like(pos_logit, dtype=tf.int32),
tf.zeros_like(neg_logit, dtype=tf.int32)], axis=-1)

how to deal with boundary problem

As we know from the paper, graph-learn have implemented four built-in graph partition algorithms to minimize the number of crossing edges whose endpoints are in different workers, but there still may be some crossing edges exsist after graph partition.

for the Graph as follow(3-hop):

image

  • hop-1: get vertex 1 neighbor(vertex 3) by adj_matrix
  • hop-2: sampling for vertex 3 need to send request from server 1 to server 2 and get vertex 3 neighbor(vertex 2) by adj_matrix
  • hop-3: sampling for vertex 2 need to send request from server 2 to server 1

I want to know graph-learn how to deal with this bounding problem or provide optimization methods to avoid this situation?

Build error with protobuf branch 3.10.x

CXX google/protobuf/text_format.lo
google/protobuf/text_format.cc: In member function ‘virtual void google::protobuf::TextFormat::FastFieldValuePrinter::PrintFloat(float, google::protobuf::TextFormat::BaseTextGenerator*) const’:
google/protobuf/text_format.cc:1623:27: error: ‘__builtin_isnan’ is not a member of ‘std’
generator->PrintString(!std::isnan(val) ? SimpleFtoa(val) : "nan");
^
google/protobuf/text_format.cc:1623:27: note: suggested alternative:
: note: ‘__builtin_isnan’
google/protobuf/text_format.cc: In member function ‘virtual void google::protobuf::TextFormat::FastFieldValuePrinter::PrintDouble(double, google::protobuf::TextFormat::BaseTextGenerator*) const’:
google/protobuf/text_format.cc:1627:27: error: ‘__builtin_isnan’ is not a member of ‘std’
generator->PrintString(!std::isnan(val) ? SimpleDtoa(val) : "nan");
^
google/protobuf/text_format.cc:1627:27: note: suggested alternative:
: note: ‘__builtin_isnan’
Makefile:4019: recipe for target 'google/protobuf/text_format.lo' failed
make[1]: *** [google/protobuf/text_format.lo] Error 1
make[1]: Leaving directory '/home/xxxx/Repo/graph-learn/third_party/protobuf/protobuf/src'
Makefile:1723: recipe for target 'install-recursive' failed
make: *** [install-recursive] Error 1

Some questions about dynamic threadpool

In dynamic_worker_threadpool.cc file, i think WaitForNotify() are used to several scenarios like:
1.between push and pop idle_threads_stack, add_task() firstly pop the stack, which make pinfo !=info, and set() before wait(), so thread don't need to wait.
2.pinfo ==info, thread automatically loop for tasks.
3.when pinfo !=info, the set() make another thread wake and continue to work(by condition signal).
So i may have confused by 2 questions:
1.before shutdown(), it seems like each time at least one thread must be active, because each thread always waked from set() which set is_set to true, so when WaitForIdle() could return?
2.Shutdown() set stopped_ to false to make active thread breaking the loop to complete. Why need to maintain the one last thread to wait for event_for_all_workers_exit_ signal? I think that when thread_num_ decrease to zero, event_for_all_workers_exit_ wait() could be return directly.

How to dump memory trace?

Dear developers,

I want to analyze the memory access profile of graph-learn as a study to optimize the performance.

Could you let me know how to dump the memory trace ?

Thanks very much!

Kevin

differences with alibaba/euler?

Hi,

Thanks for developing this open-source project, I noticed that alibaba also open-sourced euler, can anyone point out the differences between the two ?

Any plan to decouple TF-PS and distributed graph engine?

In the current implementation,client and server of graph co-place with TF-worker and TF-parameter-server.

When I want to use one TF-worker to train and multiple workers to sample data simultaneously(for GPU training). There will be some restrictions under the current architecture. So, any plan to decouple TF-PS and distributed graph engine to make architecture more flexible?

tensorflow import error

CentOS 7
python 2.7.5
tensorflow 1.12.0

I have built GL from source successfully and passed ./test_cpp_ut.sh. But when I try ./test_python_ut.sh, I got stuck by the import errors.

ImportError: cannot import name abs ./graphlearn/python/tests/test_node_weighted.py Traceback (most recent call last): File "./graphlearn/python/tests/test_node_weighted.py", line 23, in <module> import graphlearn as gl File "/usr/lib64/python2.7/site-packages/graphlearn/__init__.py", line 33, in <module> from graphlearn.python.model.tf import aggregators File "/usr/lib64/python2.7/site-packages/graphlearn/python/model/tf/aggregators/__init__.py", line 20, in <module> from graphlearn.python.model.tf.aggregators.gcn_aggregator import GCNAggregator File "/usr/lib64/python2.7/site-packages/graphlearn/python/model/tf/aggregators/gcn_aggregator.py", line 20, in <module> import tensorflow as tf File "/usr/lib/python2.7/site-packages/tensorflow/__init__.py", line 24, in <module> from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import File "/usr/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 88, in <module> from tensorflow.python import keras File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/__init__.py", line 24, in <module> from tensorflow.python.keras import activations File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/activations/__init__.py", line 22, in <module> from tensorflow.python.keras._impl.keras.activations import elu File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/_impl/keras/__init__.py", line 21, in <module> from tensorflow.python.keras._impl.keras import activations File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/_impl/keras/activations.py", line 23, in <module> from tensorflow.python.keras._impl.keras import backend as K File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/_impl/keras/backend.py", line 38, in <module> from tensorflow.python.layers import base as tf_base_layers File "/usr/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 25, in <module> from tensorflow.python.keras.engine import base_layer File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/engine/__init__.py", line 23, in <module> from tensorflow.python.keras.engine.base_layer import InputSpec File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 35, in <module> from tensorflow.python.keras import backend File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/backend/__init__.py", line 22, in <module> from tensorflow.python.keras._impl.keras.backend import abs

I believe the error comes from inconsistency of TF and it's dependencies according to the THREADS. I have tried many combinations among the packages but still failed.

Could you provide the DETAILED VERSION of TF's dependencies?

graphsage train_unsupervised.py get killed while save embeddings

Dear developpers,

When I run the graphsage ppi training exmaple, the script get killed while save embeddings:

Epoch 00, Iteration 110, Time(s) 2.8076, Loss 1.26855
Epoch 00, Iteration 111, Time(s) 0.6244, Loss 1.25834
save embedding...
(a few minutes later)
Killed

I confirm the ./id_emb exist on the right path and there're 18G space availabe on the disk.

Could you kindly let me know what's the possible reason and how to fix this issue ?

Thanks very much!

Kevin

Is graph-learn ready for all batch sampling results before training?

Is graph-learn ready for all batch sampling results before training?
For example, if I set up 1000 training times, the strategy of graph learn is to prepare the data of 1000 batches, and then use the next method to get a batch?
In the distributed mode, I first need to randomly select batch_size IDs as source vertices,
and then proceed from these source vertices to further sample.
Are these source vertices selected in the subgraph of each machine or in the whole graph?
If the sub graph is selected, the communication is reduced, but the effect may be reduced.

Thanks!

Why does the line's loss function use KL-loss alone?

Sigmoid cross entropy loss for unsuperviesd model, And kl-loss used for line.
But according to the formula cross_entropy-entropy=KL, entropy is fixed for the entire training set, so using cross_entropy is the same as using KL.
(Of course, for each specific batch, b changes.)

I don’t quite understand why line uses KL-loss, can you help?

@archwalker @baoleai
Thanks

ResourceExhaustedError in distributed mode

If I change neighs_num to [25,10], and batch_size to 100 in this file:
https://github.com/alibaba/graph-learn/blob/master/examples/tf/graphsage/dist_train.py

I see the ResourceExhaustedError and it seems related to the graph engine operations. I am not seeing any CPU, RAM bottleneck.

I run 2 parameter servers and 2 workers as shown in the wiki.

The error is:

23:56:40.288581 61070 notification.cc:194] RpcNotification:Failed req_type:LookupNodes status:Resource exhausted:Received message larger than max (8066392 vs. 4194304)
23:56:40.322116 61070 distribute_runner.h:125] Rpc failed:Resource exhausted:Received message larger than max (8066392 vs. 4194304)name:LookupNodes

Any idea how to fix this?

How to include textual data as node attributes

Say if a node has multiple aspects of textual descriptions, one way is to put them as multiple attributes for a node separated using delimiters like a colon. E.g.
id:int64 attribute:string
10001 the color is blue:round shape:it's very nice and expensive

However, if the text itself contains a colon, the split would break. What's the best way to input multiple text attributes to graph-learn? Separate then by "\t" in a line would break the code. What about putting text attributes into multiple node files (one file has one attribute)? Would that be supported?

I understand text needs to be further encoded by custom encoders, which I plan to implement.

[Question] Implementing my own embedding in AliGraph

Hello AliGraph people!

I read the VLDB 2019 paper and I found it very interesting. It is great that the project is open source.

I am wondering if I can implement my own embedding algorithm on top of your system. For example can I implement DeepWalk in AliGraph?

If yes, do you provide abstractions for defining random-walk operations in AliGraph?
If no, is it possible to build such abstractions in your system?

Thanks in advance.

Best,
Makis

Synchronous Training

Great work!
I wonder how graph-learn do synchronous training.
It would be great if there's a distributed synchronous training example.

Confusion about your DeepWalk implementation

Hi, I am exploring your deep-walk implementation, but I get a little bit confused of the gen_pair function in _positive_sample in your implementation.

Specifically, why does the gl.gen_pair slides the windows between paths, not inside each path?

  >>> path = [np.array([1, 2]), np.array([3, 4]), np.array([5, 6])]
  >>> left_window_size = right_window_size = 1
  >>> src_id, dst_ids = gen_pair(path, left_window_size, right_window_size)
  >>> print print(src_ids, dst_ids)
  >>> (array([1, 2, 3, 4, 3, 4, 5, 6]), array([3, 4, 1, 2, 5, 6, 3, 4]))

The example about starts from 3 nodes {1, 3, 5}, each with a random walk of length 2, which finally collects 3 paths, p1 : [1, 2], p2 : [3, 4], and p3 : [5, 6].

I thought the original DeepWalk paper uses SkipGram on each of these paths, so we shall apply SkipGram on p1, p2 and p3 separately. But the implementation seems to apply SkipGram on the whole set of paths. For example, it pushes all nodes in the path to pair[0] or pair[1].

Why can we do that instead of the original algorithm in the paper? Or am I misunderstanding anything?

Thanks.

Data partition issue when distributed training

Hi there,

Recently, when I use dist_train.py (examples/tf/graphsage/dist_train.py) to test distributed mode, I found that the number of iterations of each worker is quite different(in my opinion, it should be roughly equal for each worker).

Problem Detail:

The PPI dataset containing 56,944 nodes and 818,717 edges. Suppose I set the batch size to 100, so there should be 570 iters(node-based sampler) in one-epoch-training-scheduler.

  1. When I use 1 ps with 2 workers configure for distributed training, worker-0 runs 284 iters, but worker-1 runs 856 iters, the sum iters of two workers is 1140(2 * 570). The data appears to have been accessed twice in one epoch.

  2. When I use 2 ps with 1 worker configure(unusual setting, just for the experiment) for distributed training, worker-0 only runs 285 iters(570 / 2). Similarly, when I use 4 ps with 2 workers, worker-0 runs 143 iters, and worker-1 runs 143 iters. Half of the data is not used.


Some thoughts

After I read some source codes, I guess problem-1 may be caused by node_getter.cc(graphlearn/core/operator/graph/node_getter.cc)'s shared state.

When a client sends a node getter request, the NodeGetter OP will lock the DataStorage, so there will not be multi-requests reading the same data(thread-safety).

But when the cursor reaches the end of data, it will raise an OutOfRangeError to the client, and reset the cursor to 0. But for other workers which also connected to the same server, it will not receive the OutOfRangeError signal, so when they build a new request to get nodes, the shared state's cursor is already re-init to 0. So like problem-1's result, when worker-0 reaches the end of data, it will receive an OutOfRangeError and the server will reset the cursor to 0. Then, worker-0 will finish the training process. For worker-1, it will re-start from 0 to 569 to traverse the whole dataset.


Summary:

  1. When the number of workers is greater than the number of servers(ps), the server will reset shared state many(the number of workers connected to the server) times. In fact, the reset times should only depend on epoch-setting. Finally, it will cause the number of iterators is actually greater than expected.

  2. When the number of workers is lesser than the number of servers(ps), the nodes will partition to n_server parts(by hash, for now), but only the number of workers parts will be used in training. It may be caused by each client is only connected to one fixed server?

Worker memory usage keeps increasing when running graphsage dist_train.py

Problem description

When I run the graphsage dist_train.py(cora data), the worker memory usage keeps increasing:

image

When I train model with our own data, which is a larger graph, the memory usage grows faster:

image

I guess if there is any memory leak? May be that some objects of the previous iterations are not free? Any advice or suggestions will be greatly appreciated.

Environment information for cora data

docker image: registry.cn-zhangjiakou.aliyuncs.com/pai-image/graph-learn:v0.1-cpu

code path: /workspace/graph-learn/examples/tf/graphsage (in docker container)

config: 2ps, 2worker / batchsize: 32 / epoch: 40000000

Is python3 supported?

我看有一句话:Otherwise, please refer to the section 'build from source'. 但是一直没法安装成功~ 我这边的环境是python3; 阿里金融云的centos系统。但是运行以下这段话一直不行,我从网页下载master包复制过去之后,git和make操作还是不行。
git clone https://github.com/alibaba/graph-learn.git
cd graph-learn
git submodule update --init
make test
make python

GraphSAGE doesn't work with 1 hop.

When setting hops_num 1, neighs_num: [25] in examples/tf/graphsage/train_supervised.py,
it raises AssertionError:

assert self._depth + 1 == len(feature_encoders)

Standard Recommender Dataset on Bipartite GraphSage

Hi! First of all, thanks for releasing graph-learn!

Regarding the bipartite version of GraphSage, I am aware that you use the u2i.zip dataset, I have successfully run the model on that dummy dataset without issues.

I do believe though that the u2i dataset does not include any distinct node features whatsoever, much less feature vectors of different length (depending on the node type, e.g. users or items).

Have you tested the model on a standard recommendation dataset, like MovieLens? If yes, does it work out-of-the-box? I haven't really got to try it out on the dataset myself, just checking if indeed the model supports fully fledged node features, especially of different lengths.

Thanks in advance!

servers would be hang when change inter_thread_num

Hi
when i test performance using graph-learn framework and set inter_thread_num equal to 64 or greater using gl.set_inter_threadnum(64), all the servers are hang during initializing graph data and workers are waiting for servers ready

Save embeddings issue when distributed training

Hi,

I run the dist_trian.py (examples/tf/graphsage/dist_train.py), it works well. However, when I try to save embedding after training, it raises RuntimeError("Graph is finalized and cannot be modified."). I meet the same issue when I try to run Bipartite GraphSAGE in the distribute mode.

  Traceback (most recent call last):
  File "dist_train.py", line 132, in <module>
    main()
  File "dist_train.py", line 128, in main
    train(config, g)
  File "dist_train.py", line 81, in train
    u_embs = trainer.get_node_embedding("u")
  File "/usr/local/lib/python2.7/dist-packages/graphlearn/python/model/tf/trainer.py", line 57, in get_node_embedding
    ids, emb, iterator = self.model.node_embedding(node_type)

Wish clarification about two optimization strategies.

  • How to balance DNN computation on GPU and sampling compuation on cpu in graph-learn, if GPU is fast and data provided by CPU sampling is not fast enough? Generally, We will use latency hidden skill to prefetch and buffer samples that samped by CPU.

If this is not resolved, GPU will not be fully used in some situations.

Wish better clarification these trouble, thanks a lot.

Still can not run graphsage dist_train locally #4

After patch the fix(#4 #11), I rebuild/reinstall graph-learn and use below commands to start dist_train.py, problem still.

PS_HOSTS="127.0.0.1:2300,127.0.0.1:2311"
WK_HOSTS="127.0.0.1:2200,127.0.0.1:2222"

TRACK_DIR="/tmp/graphlearn/"
rm -rf ${TRACK_DIR}
mkdir -p ${TRACK_DIR}

python dist_train.py \
  --tracker=${TRACK_DIR} \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=ps \
  --task_index=0 &

sleep 2

python dist_train.py \
  --tracker=${TRACK_DIR} \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=worker \
  --task_index=0 &

sleep 2

python dist_train.py \
  --tracker=${TRACK_DIR} \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=ps \
  --task_index=1 &

sleep 2

python dist_train.py \
  --tracker=${TRACK_DIR} \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=worker \
  --task_index=1 &

wait

Stdout&Stderr
stdout&stderr.txt

Server-Logs:
graphlearn.VM_10_224_centos.ced.log.WARNING.20200405-112725.21131.log
graphlearn.VM_10_224_centos.ced.log.WARNING.20200405-112721.21023.log
graphlearn.VM_10_224_centos.ced.log.INFO.20200405-112721.21023.log
graphlearn.VM_10_224_centos.ced.log.INFO.20200405-112725.21131.log

GraphSage dist_train.py training problem.

Hi, when I try to launch a distributed training for GraphSage, I check the output log, after every iteration, I find the following ERROR output

Epoch 38, Iteration 0, Time(s) 0.0830, Loss 0.86335
Epoch 38, Iteration 1, Time(s) 0.0877, Loss 0.59674
Epoch 38, Iteration 2, Time(s) 0.0899, Loss 0.54290
Epoch 38, Iteration 3, Time(s) 0.0685, Loss 0.71597
Epoch 38, Iteration 4, Time(s) 0.0743, Loss 0.84707
Epoch 38, Iteration 5, Time(s) 0.0781, Loss 0.49838
Epoch 38, Iteration 6, Time(s) 0.0681, Loss 0.77587
E0717 17:30:22.467953   589 notification.cc:194] RpcNotification:Failed	req_type:GetNodes	status:Out of range:No more nodes exist.
E0717 17:30:22.468039   589 distribute_runner.h:125] Rpc failed:Out of range:No more nodes exist.name:GetNodes

Currently, I am setting up the two (server+client) at the same physical server with different ports.

Could you please help me to solve this problem?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.