Coder Social home page Coder Social logo

federatedai / eggroll Goto Github PK

View Code? Open in Web Editor NEW
240.0 30.0 70.0 48.03 MB

A Simple High Performance Computing Framework for [Federated] Machine Learning

License: Apache License 2.0

Python 44.47% Shell 1.24% Java 54.16% Batchfile 0.06% CSS 0.01% HTML 0.06%

eggroll's Introduction

drawing

License CodeStyle Pinpoint Satellite Style

A Simple High Performance Computing Framework for [Federated] Machine Learning

Building and Deploying Eggroll

You can check the deploy document here:

简体中文

Special thanks to:

eggroll's People

Contributors

chanzhennan avatar chengtcc avatar dependabot[bot] avatar drougon avatar dylan-fan avatar easson001 avatar forgivedengkai avatar happycooperxu avatar hmoster avatar itboyljm avatar jakob-98 avatar jarviszeng-zjc avatar liszekei avatar maxwong avatar mgqa34 avatar petersansan avatar rexningdu avatar sagewe avatar xiaoshikun avatar xiaoshikun801 avatar xiong-li-github avatar zzzcq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eggroll's Issues

Adding a space in PR template

Current PR template does not contain a space before #.
Adding a space to permit reference issue without typing an extra one.

Add support for session mechanism

Require a session mechanism for the following reasons:

  1. Memory and status 'garbage' collection.
  2. Different runtime envs in different session.
  3. Cleanup task when a session completes.

roll report connection refuse when lauching a new job

Describe the bug
roll report connection refuse when lauching a new job

What version of Eggroll and what programming language (including its version) are you using?
python.Eggroll 0.3

**What is the severance of this issue and why? **
L1 - System totally unavailable;
the training job got stuck and cannot solve it by rebooting.

How to reproduce this issue?
Steps to reproduce the behavior:
1.Lauch a new training job in fate(and stuck)
2.go to roll/logs/fate-roll.log
3.See 'connection refuse'

What did you expect to see?
should not be an error

What did you see instead?
connection refuse

Could you offer us the error logs or error screenshot?
image

What is your environment information (please complete the following information)?

  • OS: [win10]

Support 1 billion data import

Data import for 1 billion rows of data.
Need to improve performance too.

Implements in v1.x as there is a requirement from the FATE side, and port to 2.x later.
Labeled as v1.x.

Merging RollPair and RollFrame

POC of RollPair and RollFrame are completed. Each has its data structure and scheduling framework (though very similar).

A code merge of these 2 module is required.

Frameworking POC

Frameworking poc. Including cluster / node manager, storage format, data transfer etc.
This work mainly bases on roll frame poc and core lib migration.
RollPair / RollTensor poc will start soon.

Adds call sequence number in every call

To ease debug, call sequence number need to be added for each call.
This sequence number should be unique.

Suggest adding in gRPC call's metadata to avoid proto change.

Labeled in v1.x.
In 2.x, consider whether should be added in proto file.

Performance optimization of put_all required

Currently put_all is single threaded. This results in very low data input performance.
Advise implementing a parallel mechanism, e.g. input a multi-thread or multi-process put_all.

Stream error occours when lauching a new job

Describe the bug
Stream error occours when lauching a new job in fate

What version of Eggroll and what programming language (including its version) are you using?
0.3,python

**What is the severance of this issue and why? **
L0 : the training job got stuck and cannot solve it by rebooting.

How to reproduce this issue?
Steps to reproduce the behavior:

  1. Lauch a new training job in fate(and stuck)
  2. go to roll/logs/error.log
  3. See 'Stream error'

What did you expect to see?
Should be all good with no errors

What did you see instead?
Stream error

Could you offer us the error logs or error screenshot?
If applicable, add logs or screenshots to help explain your problem.
image

What is your environment information (please complete the following information)?

  • OS: [windows]
  • Version [10]

Anything else we should know about your project / environment?

并发线程提交数据Connection reset by peer

Exception in thread roll_pair-send_command-90f32f70-dba0-11ea-88c4-fa163e1070a0-py-job-93d9f2a0-dba0-11ea-8514-fa163e1070a0_putAll:
Traceback (most recent call last):
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 71, in sync_send
response = _command_stub.call(request.to_proto())
File "/data/projects/fate/common/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 565, in call
return _end_unary_response_blocking(state, call, False, None)
File "/data/projects/fate/common/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Connection reset by peer"
debug_error_string = "{"created":"@1597129331.341114809","description":"Error received from peer ipv4:xx.xx.xx.xx:32882","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"Connection reset by peer","grpc_status":14}"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/data/projects/fate/common/miniconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/data/projects/fate/common/miniconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/data/projects/fate/eggroll/python/eggroll/roll_pair/roll_pair.py", line 568, in send_command
serdes_type=SerdesTypes.PROTOBUF)
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 54, in simple_sync_send
results = self.sync_send(inputs=[input], output_types=[output_type], endpoint=endpoint, command_uri=command_uri, serdes_type=serdes_type)
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 84, in sync_send
raise CommandCallError(command_uri, endpoint, e)
eggroll.core.client.CommandCallError: ('Failed to call command: CommandURI(_uri=v1/roll-pair/runJob) to endpoint: xx.xx.xx.xx:32882, caused by: ', <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Connection reset by peer"
debug_error_string = "{"created":"@1597129331.341114809","description":"Error received from peer ipv4:xx.xxx.xx.xx:32882","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"Connection reset by peer","grpc_status":14}"

)

roll_pair测试失败

FATE: 1.4.2
EGGROLL: 2.0.1

使用KubeFATE中的Docker-Compose部署,分别启动rollsite, clustermanager, nodemanager, mysql,4个容器, 在两个主机上做集群
1.用docker pull ***拉取images,再按Docker-Compose部署 ,一切正常
2. 但如果离线build docker images,再按Docker-Compose部署,就发现6.2 roll_pair测试

python -m unittest test_roll_pair.TestRollPairCluster			--集群模式

失败,请问是哪里出问题了:

ERROR: setUpClass (test_roll_pair.TestRollPairCluster)
...
ValueError: processor in session meta is not valid:<ErSessionMeta(id=er_session_py_20200827.----_192.167.0.4, name=, status=ERROR, tag=, processors=[***, len=11], options=[{'eggroll.session.processors.per.node':'10'}]) at 0x...>

Supports data import from database and hdfs

The existing mechanism supports data import from csv file and memory. But database and hdfs are common data sources. We need to support data import directly from them.

Labeling to v1.x but v2.x also needs this feature.

Migration of core library from 1.x to 2.x

Migrating core library from 1.x to 2.x with the following changes:

  1. Embeds manual flow control mechanism in gRPC framework.
  2. Mixes scala language in the project.
  3. Moves DelayedResult to AwaitSettableFuture comply with JDK's Future system.
  4. Provides configuration items wherever possible.
  5. Removes dependency on Spring framework.

Importing data from file

Eggroll 1.x supports data import from memory only. Users have to deal with their data and import into Eggroll.

We should provide users of importing data directly from a file.
Users can pass-in their split function, returning a tuple of (key, value). Keys and values will be imported into Eggroll.

Implementing it in v1.x as there is a requirement from FATE side. Porting it into v2.x later.
Labeled as v1.x.

rollframe POC

a frame based computing and storage and transfer roll objets
changes:

columnar frame foramt support
local threads first
concurrent computing in a partition
in memory computing

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.