A Simple High Performance Computing Framework for [Federated] Machine Learning
You can check the deploy document here:
Special thanks to:
A Simple High Performance Computing Framework for [Federated] Machine Learning
License: Apache License 2.0
A Simple High Performance Computing Framework for [Federated] Machine Learning
You can check the deploy document here:
Special thanks to:
Current PR template does not contain a space before #.
Adding a space to permit reference issue without typing an extra one.
Currently, only LMDB is supported storage engine. A disk based engine is required for large data computing.
a = eggroll.table('name', 'ns')
b = a.get('key')
b is None
Require a session mechanism for the following reasons:
Describe the bug
roll report connection refuse when lauching a new job
What version of Eggroll and what programming language (including its version) are you using?
python.Eggroll 0.3
**What is the severance of this issue and why? **
L1 - System totally unavailable;
the training job got stuck and cannot solve it by rebooting.
How to reproduce this issue?
Steps to reproduce the behavior:
1.Lauch a new training job in fate(and stuck)
2.go to roll/logs/fate-roll.log
3.See 'connection refuse'
What did you expect to see?
should not be an error
What did you see instead?
connection refuse
Could you offer us the error logs or error screenshot?
What is your environment information (please complete the following information)?
Data import for 1 billion rows of data.
Need to improve performance too.
Implements in v1.x as there is a requirement from the FATE side, and port to 2.x later.
Labeled as v1.x.
POC of RollPair and RollFrame are completed. Each has its data structure and scheduling framework (though very similar).
A code merge of these 2 module is required.
Frameworking poc. Including cluster / node manager, storage format, data transfer etc.
This work mainly bases on roll frame poc and core lib migration.
RollPair / RollTensor poc will start soon.
DTable objects support GC
run_cleanup_task(func)
Parameters func is a function , not None
some sensitive info need to remove
When creating new processor in a heavy loaded machine, roll might fail to connect to a processor, showing 'connection refused'.
To ease debug, call sequence number need to be added for each call.
This sequence number should be unique.
Suggest adding in gRPC call's metadata to avoid proto change.
Labeled in v1.x.
In 2.x, consider whether should be added in proto file.
No exception is thrown when querying a table that does not exist in a directory. For example, LevelDB is not supported yet, but if we querying a levelDB table, no exception is thrown.
core/pom.xml version info change from 2.9.9 to 2.9.9.1
Currently put_all is single threaded. This results in very low data input performance.
Advise implementing a parallel mechanism, e.g. input a multi-thread or multi-process put_all.
Describe the bug
Stream error occours when lauching a new job in fate
What version of Eggroll and what programming language (including its version) are you using?
0.3,python
**What is the severance of this issue and why? **
L0 : the training job got stuck and cannot solve it by rebooting.
How to reproduce this issue?
Steps to reproduce the behavior:
What did you expect to see?
Should be all good with no errors
What did you see instead?
Stream error
Could you offer us the error logs or error screenshot?
If applicable, add logs or screenshots to help explain your problem.
What is your environment information (please complete the following information)?
Anything else we should know about your project / environment?
When session is null, no computing engine will be created.
Need to provide default computing engines when session is null.
Exception in thread roll_pair-send_command-90f32f70-dba0-11ea-88c4-fa163e1070a0-py-job-93d9f2a0-dba0-11ea-8514-fa163e1070a0_putAll:
Traceback (most recent call last):
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 71, in sync_send
response = _command_stub.call(request.to_proto())
File "/data/projects/fate/common/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 565, in call
return _end_unary_response_blocking(state, call, False, None)
File "/data/projects/fate/common/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Connection reset by peer"
debug_error_string = "{"created":"@1597129331.341114809","description":"Error received from peer ipv4:xx.xx.xx.xx:32882","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"Connection reset by peer","grpc_status":14}"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/projects/fate/common/miniconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/data/projects/fate/common/miniconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/data/projects/fate/eggroll/python/eggroll/roll_pair/roll_pair.py", line 568, in send_command
serdes_type=SerdesTypes.PROTOBUF)
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 54, in simple_sync_send
results = self.sync_send(inputs=[input], output_types=[output_type], endpoint=endpoint, command_uri=command_uri, serdes_type=serdes_type)
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 84, in sync_send
raise CommandCallError(command_uri, endpoint, e)
eggroll.core.client.CommandCallError: ('Failed to call command: CommandURI(_uri=v1/roll-pair/runJob) to endpoint: xx.xx.xx.xx:32882, caused by: ', <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Connection reset by peer"
debug_error_string = "{"created":"@1597129331.341114809","description":"Error received from peer ipv4:xx.xxx.xx.xx:32882","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"Connection reset by peer","grpc_status":14}"
)
Hello, please describe the role of rolframe in eggroll. I did not find the package eggroll-roll-pair-2.0.1.jar in the packaged lib
Describe the bug
In eggroll 1.x, the return value of flatMap or mapPartition2 is a list, which depends heavily on memory. Hope that in later version of eggroll, the result is a generator can be support
FATE: 1.4.2
EGGROLL: 2.0.1
使用KubeFATE中的Docker-Compose部署,分别启动rollsite, clustermanager, nodemanager, mysql,4个容器, 在两个主机上做集群
1.用docker pull ***拉取images,再按Docker-Compose部署 ,一切正常
2. 但如果离线build docker images,再按Docker-Compose部署,就发现6.2 roll_pair测试
python -m unittest test_roll_pair.TestRollPairCluster --集群模式
失败,请问是哪里出问题了:
ERROR: setUpClass (test_roll_pair.TestRollPairCluster)
...
ValueError: processor in session meta is not valid:<ErSessionMeta(id=er_session_py_20200827.----_192.167.0.4, name=, status=ERROR, tag=, processors=[***, len=11], options=[{'eggroll.session.processors.per.node':'10'}]) at 0x...>
If users choose LEVEL_DB (actually RocksDB) as their storage engine, a destroy()
call will not delete the data file.
Seems like a bug and it needs fix.
Add rocksdb and network support
The existing mechanism supports data import from csv file and memory. But database and hdfs are common data sources. We need to support data import directly from them.
Labeling to v1.x but v2.x also needs this feature.
Migrating core library from 1.x to 2.x with the following changes:
The package-version in auto-packaging.sh still 0.3
Update the path in services.sh
Such as
Eggroll 1.x supports data import from memory only. Users have to deal with their data and import into Eggroll.
We should provide users of importing data directly from a file.
Users can pass-in their split function, returning a tuple of (key, value)
. Keys and values will be imported into Eggroll.
Implementing it in v1.x as there is a requirement from FATE side. Porting it into v2.x later.
Labeled as v1.x.
add log webserver for every node
a frame based computing and storage and transfer roll objets
changes:
columnar frame foramt support
local threads first
concurrent computing in a partition
in memory computing
各位大婶,是否可以提供api把暴露线程池当前占用的线程数量 或者打印到日志?
Cannot find linking libraries in centos.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.