alibaba / graph-learn Goto Github PK
View Code? Open in Web Editor NEWAn Industrial Graph Neural Network Framework
License: Apache License 2.0
An Industrial Graph Neural Network Framework
License: Apache License 2.0
我看有一句话:Otherwise, please refer to the section 'build from source'. 但是一直没法安装成功~ 我这边的环境是python3; 阿里金融云的centos系统。但是运行以下这段话一直不行,我从网页下载master包复制过去之后,git和make操作还是不行。
git clone https://github.com/alibaba/graph-learn.git
cd graph-learn
git submodule update --init
make test
make python
What are the roles of client and server?
About deploy_ Mode:
what is 2 considered relative to 0 and 1
Thank you for the answer!
CentOS 7
python 2.7.5
tensorflow 1.12.0
I have built GL from source successfully and passed ./test_cpp_ut.sh. But when I try ./test_python_ut.sh, I got stuck by the import errors.
ImportError: cannot import name abs ./graphlearn/python/tests/test_node_weighted.py Traceback (most recent call last): File "./graphlearn/python/tests/test_node_weighted.py", line 23, in <module> import graphlearn as gl File "/usr/lib64/python2.7/site-packages/graphlearn/__init__.py", line 33, in <module> from graphlearn.python.model.tf import aggregators File "/usr/lib64/python2.7/site-packages/graphlearn/python/model/tf/aggregators/__init__.py", line 20, in <module> from graphlearn.python.model.tf.aggregators.gcn_aggregator import GCNAggregator File "/usr/lib64/python2.7/site-packages/graphlearn/python/model/tf/aggregators/gcn_aggregator.py", line 20, in <module> import tensorflow as tf File "/usr/lib/python2.7/site-packages/tensorflow/__init__.py", line 24, in <module> from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import File "/usr/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 88, in <module> from tensorflow.python import keras File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/__init__.py", line 24, in <module> from tensorflow.python.keras import activations File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/activations/__init__.py", line 22, in <module> from tensorflow.python.keras._impl.keras.activations import elu File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/_impl/keras/__init__.py", line 21, in <module> from tensorflow.python.keras._impl.keras import activations File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/_impl/keras/activations.py", line 23, in <module> from tensorflow.python.keras._impl.keras import backend as K File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/_impl/keras/backend.py", line 38, in <module> from tensorflow.python.layers import base as tf_base_layers File "/usr/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 25, in <module> from tensorflow.python.keras.engine import base_layer File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/engine/__init__.py", line 23, in <module> from tensorflow.python.keras.engine.base_layer import InputSpec File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 35, in <module> from tensorflow.python.keras import backend File "/usr/lib/python2.7/site-packages/tensorflow/python/keras/backend/__init__.py", line 22, in <module> from tensorflow.python.keras._impl.keras.backend import abs
I believe the error comes from inconsistency of TF and it's dependencies according to the THREADS. I have tried many combinations among the packages but still failed.
Could you provide the DETAILED VERSION of TF's dependencies?
Sigmoid cross entropy loss for unsuperviesd model, And kl-loss used for line.
But according to the formula cross_entropy-entropy=KL, entropy is fixed for the entire training set, so using cross_entropy is the same as using KL.
(Of course, for each specific batch, b changes.)
I don’t quite understand why line uses KL-loss, can you help?
@archwalker @baoleai
Thanks
I try to install graph-learn from source. when I run make python
, I get the following error:
My python version is 3.74.
python /home/kanon/code/graph-learn/setup/setup.py bdist_wheel
/home/kanon/anaconda3/lib/python3.7/site-packages/setuptools/dist.py:462: UserWarning: The version specified (b'0.1') is an invalid version, this may not work as expected with newer versions of setuptools, pip, and PyPI. Please see PEP 440 for more details.
"details." % self.metadata.version
running bdist_wheel
Traceback (most recent call last):
File "/home/kanon/code/graph-learn/setup/setup.py", line 85, in <module>
package_data={'': ['python/lib/lib*.so*']},
File "/home/kanon/anaconda3/lib/python3.7/site-packages/setuptools/__init__.py", line 144, in setup
return distutils.core.setup(**attrs)
File "/home/kanon/anaconda3/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/kanon/anaconda3/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/home/kanon/anaconda3/lib/python3.7/distutils/dist.py", line 984, in run_command
cmd_obj.ensure_finalized()
File "/home/kanon/anaconda3/lib/python3.7/distutils/cmd.py", line 107, in ensure_finalized
self.finalize_options()
File "/home/kanon/anaconda3/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 129, in finalize_options
self.data_dir = self.wheel_dist_name + '.data'
File "/home/kanon/anaconda3/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 164, in wheel_dist_name
safer_version(self.distribution.get_version()))
File "/home/kanon/anaconda3/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 43, in safer_version
return safe_version(version).replace('-', '_')
File "/home/kanon/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1333, in safe_version
return str(packaging.version.Version(version))
File "/home/kanon/anaconda3/lib/python3.7/site-packages/pkg_resources/_vendor/packaging/version.py", line 200, in __init__
match = self._regex.search(version)
TypeError: cannot use a string pattern on a bytes-like object
Makefile:317: recipe for target 'python' failed
make: *** [python] Error 1
OS: CentOS 7
python: 3.6
I have encountered several error when I try to build GL from souce. It seems to be the python version error.
hey guys, thank you for making graph-learn opensourced. however, i have run the demo in bipartite-graphsage, but the result is aweful (with a rather low recall and precision), can you give me some hint on it?
i just follow the instructions on this page:
bipartite_graphsage
Where is the 'train_unsupervised.py' code file?
I set those config below but get error info AssertionError: ego_spec num must be the same with hops num!
'hops_num': 3,
'u_neighs_num': [5,3,2],
'i_neighs_num': [5,3,2],
Great work!
I wonder how graph-learn do synchronous training.
It would be great if there's a distributed synchronous training example.
When I run the graphsage dist_train.py(cora data), the worker memory usage keeps increasing:
When I train model with our own data, which is a larger graph, the memory usage grows faster:
I guess if there is any memory leak? May be that some objects of the previous iterations are not free? Any advice or suggestions will be greatly appreciated.
docker image: registry.cn-zhangjiakou.aliyuncs.com/pai-image/graph-learn:v0.1-cpu
code path: /workspace/graph-learn/examples/tf/graphsage (in docker container)
config: 2ps, 2worker / batchsize: 32 / epoch: 40000000
Is graph-learn ready for all batch sampling results before training?
For example, if I set up 1000 training times, the strategy of graph learn is to prepare the data of 1000 batches, and then use the next method to get a batch?
In the distributed mode, I first need to randomly select batch_size IDs as source vertices,
and then proceed from these source vertices to further sample.
Are these source vertices selected in the subgraph of each machine or in the whole graph?
If the sub graph is selected, the communication is reduced, but the effect may be reduced.
Thanks!
Could you give us a few examples for the classical graph algorithms, such as Pagerank, LPA, WCC, etc.?
Hi,
Thanks for developing this open-source project, I noticed that alibaba also open-sourced euler, can anyone point out the differences between the two ?
Hi, I am exploring your deep-walk implementation, but I get a little bit confused of the gen_pair function in _positive_sample in your implementation.
Specifically, why does the gl.gen_pair
slides the windows between paths, not inside each path?
>>> path = [np.array([1, 2]), np.array([3, 4]), np.array([5, 6])]
>>> left_window_size = right_window_size = 1
>>> src_id, dst_ids = gen_pair(path, left_window_size, right_window_size)
>>> print print(src_ids, dst_ids)
>>> (array([1, 2, 3, 4, 3, 4, 5, 6]), array([3, 4, 1, 2, 5, 6, 3, 4]))
The example about starts from 3 nodes {1, 3, 5}
, each with a random walk of length 2, which finally collects 3 paths, p1 : [1, 2], p2 : [3, 4]
, and p3 : [5, 6]
.
I thought the original DeepWalk paper uses SkipGram on each of these paths, so we shall apply SkipGram on p1, p2 and p3 separately. But the implementation seems to apply SkipGram on the whole set of paths. For example, it pushes all nodes in the path to pair[0] or pair[1].
Why can we do that instead of the original algorithm in the paper? Or am I misunderstanding anything?
Thanks.
Do you have any references on Bipartite GraphSage? What's the differences between Bipartite GraphSage and classical GraphSage?
Hi,
I run the dist_trian.py (examples/tf/graphsage/dist_train.py), it works well. However, when I try to save embedding after training, it raises RuntimeError("Graph is finalized and cannot be modified."). I meet the same issue when I try to run Bipartite GraphSAGE in the distribute mode.
Traceback (most recent call last):
File "dist_train.py", line 132, in <module>
main()
File "dist_train.py", line 128, in main
train(config, g)
File "dist_train.py", line 81, in train
u_embs = trainer.get_node_embedding("u")
File "/usr/local/lib/python2.7/dist-packages/graphlearn/python/model/tf/trainer.py", line 57, in get_node_embedding
ids, emb, iterator = self.model.node_embedding(node_type)
In Distribute runtime design, it shows client will send op request to only 1 server, and server will send partitioned requests.
However, Distributed mode shows client will connect with multiple servers.
I'm a bit confused about client-server mode. does client send op request to multiple servers?
I checked grpc_client.cc. It seems client will connect to 1 server. if so, this picture is not correct.
I can install from source successfully by removing -mavx from CXXFLAGS. Could you add an option in Makefile to auto detect AVX support?
When setting hops_num 1, neighs_num: [25]
in examples/tf/graphsage/train_supervised.py
,
it raises AssertionError:
assert self._depth + 1 == len(feature_encoders)
In dynamic_worker_threadpool.cc file, i think WaitForNotify() are used to several scenarios like:
1.between push and pop idle_threads_stack, add_task() firstly pop the stack, which make pinfo !=info, and set() before wait(), so thread don't need to wait.
2.pinfo ==info, thread automatically loop for tasks.
3.when pinfo !=info, the set() make another thread wake and continue to work(by condition signal).
So i may have confused by 2 questions:
1.before shutdown(), it seems like each time at least one thread must be active, because each thread always waked from set() which set is_set to true, so when WaitForIdle() could return?
2.Shutdown() set stopped_ to false to make active thread breaking the loop to complete. Why need to maintain the one last thread to wait for event_for_all_workers_exit_ signal? I think that when thread_num_ decrease to zero, event_for_all_workers_exit_ wait() could be return directly.
I got an error when I try to install from wheel file on CentOS 7. Do you have any plan on CentOS?
ERROR: graphlearn-0.1-cp27-cp27mu-linux_x86_64.whl is not a supported wheel on this platform.
如题
I find graph-learn has aggregator with both c++ and python version (mean
, sum
...).
Why Graph-Learn need c++ version aggregator
?
Hi
when i test performance using graph-learn framework and set inter_thread_num equal to 64 or greater using gl.set_inter_threadnum(64)
, all the servers are hang during initializing graph data and workers are waiting for servers ready
Dear developpers,
When I run the graphsage ppi training exmaple, the script get killed while save embeddings:
Epoch 00, Iteration 110, Time(s) 2.8076, Loss 1.26855
Epoch 00, Iteration 111, Time(s) 0.6244, Loss 1.25834
save embedding...
(a few minutes later)
Killed
I confirm the ./id_emb exist on the right path and there're 18G space availabe on the disk.
Could you kindly let me know what's the possible reason and how to fix this issue ?
Thanks very much!
Kevin
Hello AliGraph people!
I read the VLDB 2019 paper and I found it very interesting. It is great that the project is open source.
I am wondering if I can implement my own embedding algorithm on top of your system. For example can I implement DeepWalk in AliGraph?
If yes, do you provide abstractions for defining random-walk operations in AliGraph?
If no, is it possible to build such abstractions in your system?
Thanks in advance.
Best,
Makis
If I change neighs_num to [25,10], and batch_size to 100 in this file:
https://github.com/alibaba/graph-learn/blob/master/examples/tf/graphsage/dist_train.py
I see the ResourceExhaustedError and it seems related to the graph engine operations. I am not seeing any CPU, RAM bottleneck.
I run 2 parameter servers and 2 workers as shown in the wiki.
The error is:
23:56:40.288581 61070 notification.cc:194] RpcNotification:Failed req_type:LookupNodes status:Resource exhausted:Received message larger than max (8066392 vs. 4194304)
23:56:40.322116 61070 distribute_runner.h:125] Rpc failed:Resource exhausted:Received message larger than max (8066392 vs. 4194304)name:LookupNodes
Any idea how to fix this?
Hello. I'm interested in how to use METIS partitioning algorithm with GL. Is there any tutorial or doc? Thank you!
First of all, thank you guys for opensource such an amazing project.
I try to follow THIS manual play with distributed training on a single machine, but fail to start training process.
Here is my script to start ps and worker process.
PS_HOSTS="127.0.0.1:2300,127.0.0.1:2311"
WK_HOSTS="127.0.0.1:2200,127.0.0.1:2222"
python dist_train.py \
--tracker=./distributed \
--ps_hosts=${PS_HOSTS} \
--worker_hosts=${WK_HOSTS} \
--job_name=ps \
--task_index=0 &
python dist_train.py \
--tracker=./distributed \
--ps_hosts=${PS_HOSTS} \
--worker_hosts=${WK_HOSTS} \
--job_name=worker \
--task_index=0 &
python dist_train.py \
--tracker=./distributed \
--ps_hosts=${PS_HOSTS} \
--worker_hosts=${WK_HOSTS} \
--job_name=ps \
--task_index=1 &
python dist_train.py \
--tracker=./distributed \
--ps_hosts=${PS_HOSTS} \
--worker_hosts=${WK_HOSTS} \
--job_name=worker \
--task_index=1 &
wait
And also I add some log in Graph.init() function( https://github.com/alibaba/graph-learn/blob/master/graphlearn/python/graph.py ), but can not see "############# Server init done #############" been printout.
if job_name == "client":
pywrap.set_client_id(task_index)
self._client = pywrap.rpc_client()
self._server = None
else:
print("############# Server init start #############")
if job_name == "server":
self._client = None
if not tracker and kwargs.get("tracker"):
tracker = kwargs["tracker"]
if tracker:
self._server = Server(task_index, server_count, tracker)
else:
self._server = Server(task_index, server_count)
self._server.start()
print("############# Server start done #############")
self._server.init(self._edge_sources, self._node_sources)
print("############# Server init done #############")
return self
Anything I can get list below, it's keep printing Invalid endpoint file: 0
till the end of the world.
main
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0402 13:10:49.755939 10816 naming_engine.cc:56] Connect naming engine ok: ./distributed/endpoints/
I0402 13:10:49.756223 10816 channel_manager.cc:94] Auto select server: 1
W0402 13:10:49.756240 10816 channel_manager.cc:100] Waiting for all servers started: 0/2
W0402 13:10:49.756494 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:49.756530 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:49.756541 10904 naming_engine.cc:159] Refresh endpoints count: 0
2020-04-02 13:10:49.771019: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-02 13:10:49.777325: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:2300, 1 -> 127.0.0.1:2311}
2020-04-02 13:10:49.777366: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:2200, 1 -> localhost:2222}
2020-04-02 13:10:49.784454: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:2222
main
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0402 13:10:49.878661 10814 naming_engine.cc:56] Connect naming engine ok: ./distributed/endpoints/
I0402 13:10:49.878902 10814 channel_manager.cc:94] Auto select server: 0
W0402 13:10:49.878921 10814 channel_manager.cc:100] Waiting for all servers started: 0/2
W0402 13:10:49.880380 10951 naming_engine.cc:154] Invalid endpoint file: 0
main
W0402 13:10:49.880429 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:49.880441 10951 naming_engine.cc:159] Refresh endpoints count: 0
############# Server init start #############
2020-04-02 13:10:49.894944: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-02 13:10:49.900562: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:2300, 1 -> 127.0.0.1:2311}
2020-04-02 13:10:49.900591: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2200, 1 -> 127.0.0.1:2222}
2020-04-02 13:10:49.901519: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:2200
main
############# Server init start #############
W0402 13:10:50.756636 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:50.756687 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:50.756696 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:50.880582 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:50.880635 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:50.880697 10951 naming_engine.cc:159] Refresh endpoints count: 0
[2020-04-02 13:10:50.888773] Server started.
############# Server start done #############
[2020-04-02 13:10:50.985136] Server started.
############# Server start done #############
W0402 13:10:51.756803 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:51.756860 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:51.756868 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:51.880851 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:51.880900 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:51.880908 10951 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:52.756978 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:52.757043 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:52.757053 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:52.881058 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:52.881108 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:52.881115 10951 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:53.757174 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:53.757233 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:53.757244 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:53.881242 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:53.881289 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:53.881297 10951 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:54.757366 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:54.757421 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:54.757429 10904 naming_engine.cc:159] Refresh endpoints count: 0
Any clue? Thank you!
Say if a node has multiple aspects of textual descriptions, one way is to put them as multiple attributes for a node separated using delimiters like a colon. E.g.
id:int64 attribute:string
10001 the color is blue:round shape:it's very nice and expensive
However, if the text itself contains a colon, the split would break. What's the best way to input multiple text attributes to graph-learn? Separate then by "\t" in a line would break the code. What about putting text attributes into multiple node files (one file has one attribute)? Would that be supported?
I understand text needs to be further encoded by custom encoders, which I plan to implement.
Would you like to wrap any pointer data members with the class template “std::unique_ptr”?
Building the python wheel from master
branch and run gcn model will get
python train_supervised.pyTraceback (most recent call last):
File "train_supervised.py", line 22, in <module>
import graphlearn as gl
File "/home/cn/research/graph/venv/lib/python2.7/site-packages/graphlearn/__init__.py", line 16, in <module>
from graphlearn import pywrap_graphlearn as pywrap
ImportError: /home/cn/research/graph/graph-learn/built/lib/libgraphlearn_shared.so: undefined symbol: _ZN6google14FlagRegistererC1IiEEPKcS3_S3_PT_S5_
It seems that symbol _ZN6google14FlagRegistererC1IiEEPKcS3_S3_PT_S5_
(which is the mangled FlagRegisterer
from gflags package ) was not found.
So, I tried to modify the Makefile
and link gflags
library manually, then the problem was solved.(modified line: here)
so:protobuf grpc glog gtest proto common platform service core
@mkdir -p $(INCLUDE_DIR)
@mkdir -p $(LIB_DIR)
@mkdir -p $(BIN_DIR)
$(CXX) $(CXXFLAGS) -shared $(PROTO_OBJ) $(COMMON_OBJ) $(PLATFORM_OBJ) $(SERVICE_OBJ) $(CORE_OBJ) \
-L$(ROOT) -L$(GLOG_LIB) -L$(PROTOBUF_LIB) -L$(GRPC_LIB) -L$(GFLAGS_LIB)\
-lglog -lprotobuf -lgrpc++ -lgrpc -lgpr -lupb -lgflags\
-o $(LIB_DIR)/libgraphlearn_shared.so
Is this a bug or one need special way not mentioned in doc so far to link gflags ?
In the current implementation,client and server of graph co-place with TF-worker and TF-parameter-server.
When I want to use one TF-worker to train and multiple workers to sample data simultaneously(for GPU training). There will be some restrictions under the current architecture. So, any plan to decouple TF-PS and distributed graph engine to make architecture more flexible?
CXX google/protobuf/text_format.lo
google/protobuf/text_format.cc: In member function ‘virtual void google::protobuf::TextFormat::FastFieldValuePrinter::PrintFloat(float, google::protobuf::TextFormat::BaseTextGenerator*) const’:
google/protobuf/text_format.cc:1623:27: error: ‘__builtin_isnan’ is not a member of ‘std’
generator->PrintString(!std::isnan(val) ? SimpleFtoa(val) : "nan");
^
google/protobuf/text_format.cc:1623:27: note: suggested alternative:
: note: ‘__builtin_isnan’
google/protobuf/text_format.cc: In member function ‘virtual void google::protobuf::TextFormat::FastFieldValuePrinter::PrintDouble(double, google::protobuf::TextFormat::BaseTextGenerator*) const’:
google/protobuf/text_format.cc:1627:27: error: ‘__builtin_isnan’ is not a member of ‘std’
generator->PrintString(!std::isnan(val) ? SimpleDtoa(val) : "nan");
^
google/protobuf/text_format.cc:1627:27: note: suggested alternative:
: note: ‘__builtin_isnan’
Makefile:4019: recipe for target 'google/protobuf/text_format.lo' failed
make[1]: *** [google/protobuf/text_format.lo] Error 1
make[1]: Leaving directory '/home/xxxx/Repo/graph-learn/third_party/protobuf/protobuf/src'
Makefile:1723: recipe for target 'install-recursive' failed
make: *** [install-recursive] Error 1
I don't see any demo about gcn or graph algorithm trained with hdfs data, and pull the node embedding to local or hdfs for use. would you add some, thank you!
As we know from the paper, graph-learn have implemented four built-in graph partition algorithms to minimize the number of crossing edges whose endpoints are in different workers, but there still may be some crossing edges exsist after graph partition.
for the Graph as follow(3-hop):
I want to know graph-learn how to deal with this bounding problem or provide optimization methods to avoid this situation?
Dear developers,
I want to analyze the memory access profile of graph-learn as a study to optimize the performance.
Could you let me know how to dump the memory trace ?
Thanks very much!
Kevin
In my project ,my id is bank card which length is 21.Is there any way to solve this problem?thanks!
I parser my dataset to formated file like this:
node:
id:int64 attribute:string
3916 c
1819 c
4501 c
edge:
src_id:64 dst_id:int64 attribute:string
4 6 p
4 7 p
4 8 p
4 9 j
then I run commands:
g = g.node(source="data/graph_data/node0.txt", node_type="entry", decoder=gl.Decoder(attr_types=["string"]))
g = g.edge(source="data/graph_data/edge0.txt", edge_type = ("entry", "entry", "action"), decoder=gl.Decoder(attr_types=["string"]))
g = g.init()
when I tried to get node's attibute, I found it is empty:
In [11]: res.__dict__
Out[11]:
{'_attred': False,
'_float_attrs': None,
'_graph': <graphlearn.python.graph.Graph at 0x7fc178e71d90>,
'_ids': array([ 4, 3916]),
'_int_attrs': None,
'_labels': None,
'_shape': (2,),
'_string_attrs': None,
'_type': 'entry',
'_weights': None}
Is there any step wrong?
Hi there,
Recently, when I use dist_train.py (examples/tf/graphsage/dist_train.py) to test distributed mode, I found that the number of iterations of each worker is quite different(in my opinion, it should be roughly equal for each worker).
The PPI dataset containing 56,944 nodes and 818,717 edges. Suppose I set the batch size to 100, so there should be 570 iters(node-based sampler) in one-epoch-training-scheduler.
When I use 1 ps with 2 workers configure for distributed training, worker-0 runs 284 iters, but worker-1 runs 856 iters, the sum iters of two workers is 1140(2 * 570). The data appears to have been accessed twice in one epoch.
When I use 2 ps with 1 worker configure(unusual setting, just for the experiment) for distributed training, worker-0 only runs 285 iters(570 / 2). Similarly, when I use 4 ps with 2 workers, worker-0 runs 143 iters, and worker-1 runs 143 iters. Half of the data is not used.
After I read some source codes, I guess problem-1 may be caused by node_getter.cc(graphlearn/core/operator/graph/node_getter.cc)'s shared state.
When a client sends a node getter request, the NodeGetter OP will lock the DataStorage, so there will not be multi-requests reading the same data(thread-safety).
But when the cursor reaches the end of data, it will raise an OutOfRangeError
to the client, and reset the cursor to 0
. But for other workers which also connected to the same server, it will not receive the OutOfRangeError
signal, so when they build a new request to get nodes, the shared state's cursor is already re-init to 0
. So like problem-1's result, when worker-0 reaches the end of data, it will receive an OutOfRangeError
and the server will reset the cursor to 0. Then, worker-0 will finish the training process. For worker-1, it will re-start from 0 to 569 to traverse the whole dataset.
When the number of workers is greater than the number of servers(ps), the server will reset shared state many(the number of workers connected to the server) times. In fact, the reset times should only depend on epoch-setting. Finally, it will cause the number of iterators is actually greater than expected.
When the number of workers is lesser than the number of servers(ps), the nodes will partition to n_server parts(by hash, for now), but only the number of workers parts will be used in training. It may be caused by each client is only connected to one fixed server?
pos_logit用sim_function, 为什么 neg_logit 一直用tf.multiply呢?
pos_logit = sim_function(src_emb, pos_emb)
src_emb_exp = tf.tile(tf.expand_dims(src_emb, axis=1),
[1, per_sample_neg_num, 1])
src_emb_exp = tf.reshape(src_emb_exp, [-1, emb_dim])
neg_logit = tf.reduce_sum(tf.multiply(src_emb_exp, neg_emb), axis=-1
true_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.ones_like(pos_logit), logits=pos_logit)
negative_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.zeros_like(neg_logit), logits=neg_logit)
loss = tf.reduce_mean(true_xent) + 1.0 * tf.reduce_mean(negative_xent)
logit = tf.concat([pos_logit, neg_logit], axis=-1)
label = tf.concat([tf.ones_like(pos_logit, dtype=tf.int32),
tf.zeros_like(neg_logit, dtype=tf.int32)], axis=-1)
An example
g.V("user").with_attribute("gender=man").batch(64).outV("buy").sample(2).with_attribute("city=1").by("random")
If this is not resolved, GPU will not be fully used in some situations.
Wish better clarification these trouble, thanks a lot.
Hi, when I try to launch a distributed training for GraphSage, I check the output log, after every iteration, I find the following ERROR output
Epoch 38, Iteration 0, Time(s) 0.0830, Loss 0.86335
Epoch 38, Iteration 1, Time(s) 0.0877, Loss 0.59674
Epoch 38, Iteration 2, Time(s) 0.0899, Loss 0.54290
Epoch 38, Iteration 3, Time(s) 0.0685, Loss 0.71597
Epoch 38, Iteration 4, Time(s) 0.0743, Loss 0.84707
Epoch 38, Iteration 5, Time(s) 0.0781, Loss 0.49838
Epoch 38, Iteration 6, Time(s) 0.0681, Loss 0.77587
E0717 17:30:22.467953 589 notification.cc:194] RpcNotification:Failed req_type:GetNodes status:Out of range:No more nodes exist.
E0717 17:30:22.468039 589 distribute_runner.h:125] Rpc failed:Out of range:No more nodes exist.name:GetNodes
Currently, I am setting up the two (server+client) at the same physical server with different ports.
Could you please help me to solve this problem?
Thanks!
Hi! First of all, thanks for releasing graph-learn!
Regarding the bipartite version of GraphSage, I am aware that you use the u2i.zip
dataset, I have successfully run the model on that dummy dataset without issues.
I do believe though that the u2i
dataset does not include any distinct node features whatsoever, much less feature vectors of different length (depending on the node type, e.g. users or items).
Have you tested the model on a standard recommendation dataset, like MovieLens
? If yes, does it work out-of-the-box? I haven't really got to try it out on the dataset myself, just checking if indeed the model supports fully fledged node features, especially of different lengths.
Thanks in advance!
After patch the fix(#4 #11), I rebuild/reinstall graph-learn and use below commands to start dist_train.py, problem still.
PS_HOSTS="127.0.0.1:2300,127.0.0.1:2311"
WK_HOSTS="127.0.0.1:2200,127.0.0.1:2222"
TRACK_DIR="/tmp/graphlearn/"
rm -rf ${TRACK_DIR}
mkdir -p ${TRACK_DIR}
python dist_train.py \
--tracker=${TRACK_DIR} \
--ps_hosts=${PS_HOSTS} \
--worker_hosts=${WK_HOSTS} \
--job_name=ps \
--task_index=0 &
sleep 2
python dist_train.py \
--tracker=${TRACK_DIR} \
--ps_hosts=${PS_HOSTS} \
--worker_hosts=${WK_HOSTS} \
--job_name=worker \
--task_index=0 &
sleep 2
python dist_train.py \
--tracker=${TRACK_DIR} \
--ps_hosts=${PS_HOSTS} \
--worker_hosts=${WK_HOSTS} \
--job_name=ps \
--task_index=1 &
sleep 2
python dist_train.py \
--tracker=${TRACK_DIR} \
--ps_hosts=${PS_HOSTS} \
--worker_hosts=${WK_HOSTS} \
--job_name=worker \
--task_index=1 &
wait
Stdout&Stderr
stdout&stderr.txt
Server-Logs:
graphlearn.VM_10_224_centos.ced.log.WARNING.20200405-112725.21131.log
graphlearn.VM_10_224_centos.ced.log.WARNING.20200405-112721.21023.log
graphlearn.VM_10_224_centos.ced.log.INFO.20200405-112721.21023.log
graphlearn.VM_10_224_centos.ced.log.INFO.20200405-112725.21131.log
https://github.com/alibaba/graph-learn/blob/master/docs/query.md#inneg
inNeg(edge_type). Nodes to Nodes. Traverse to the source negative node along the edge. The edge must be undirected.
For exmaple, the topology is nodeA--(edge)-->nodeB, nodeB.outNeg(edge) is nodeA.
SHOULD BE "nodeB.inNeg(edge)" is nodeA?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.