hydro-project / cloudburst Goto Github PK

A stateful serverless platform

License: Apache License 2.0

Dockerfile 1.35% Shell 4.57% Python 94.08%

cloudburst's Introduction

Cloudburst

Cloudburst is a low-latency, stateful serverless programming framework built on top of the Anna KVS. Cloudburst enables users to execute compositions of functions at low latency, and the system builds on top of Anna in order to enable stateful computation. Cloudburst is co-deployed with the Anna caching system to achieve low-latency access to shared state, and the system relies on Anna's lattice data structures to resolve conflicting updates to shared state.

Getting Started

You can install Cloudburst's dependencies with pip and use the bash scripts included in this repository to run the system locally. You can find the Cloudburst client in cloudburst/client/client.py. Full documentation on starting a cluster in local mode can be found here; documentation for the Cloudburst client can be found here. An example interaction is modeled below.

$ pip3 install -r requirements.txt
$ ./scripts/start-cloudburst-local.sh n n
...
$ ./scripts/stop-cloudburst-local.sh n

The CloudburstConnection is the main client interface; when running in local mode, all interaction between the client and server happens on localhost. Users can register functions and execute them. The executions return CloudburstFutures, which can be retrieved asynchronously via the get method. Users can also register DAGs (directed, acylic graphs) of functions, where results from one function will be passed to downstream functions.

>>> from cloudburst.client.client import CloudburstConnection
>>> local_cloud = CloudburstConnection('127.0.0.1', '127.0.0.1', local=True)
>>> cloud_sq = local_cloud.register(lambda _, x: x * x, 'square')
>>> cloud_sq(2).get()
4
>>> local_cloud.register_dag('dag', ['square'], [])
>>> local_cloud.call_dag('dag', { 'square': [2] }).get()
4

To run Anna and Cloudburst in cluster mode, you will need to use the cluster management setup, which can be found in the hydro-project/cluster repo. Instructions on how to use the cluster management tools can be found in that repo.

License

The Hydro Project is licensed under the Apache v2 License.

cloudburst's People

Contributors

Stargazers

Watchers

Forkers

cw75 xcharleslin saurav-c cs262-pfs jegonzal devin-petersohn jake-bladt alicelyy yiranjia fx196 debbieliang9 alansynn resouer kevinric kmindspark catherine917 lunayin keplerc faisalsyn avinash-arjavalingam eric-vader mincyu alchem-lab simon-mo ss87021456 qianlong21st taegeonum umar-dds umar-nawaz sudeepalbal123 wisc-landherr tingjia980311 ky0n francisbehnen spector-in-london sjtu-serverless zhipeng-jia mak-azad akazad smkhalsa lt2000 dbos-project moridi l-ett flowdrucka64 satorikoishi silveryu1999 djun williamericcheung keshavaspanda wanghanshuo1220

cloudburst's Issues

Smarter reference management for the scheduler

Right now, the scheduler deserializes all arguments and looks through them in order to determine which ones are CloudburstReferences and which ones are not. This can be really expensive if deserialization is slow and/or the data is large. Instead, we should elevate these references into the protobufs, so the scheduler can look at them without doing any deserialization. Note that we will still want to keep the references in the arguments themselves, too, because we need to know where to re-insert the references once we resolve them.

Getting Stuck on creating a CloudburstConnection object.

In the README.md, there is

from cloudburst.client.client import CloudburstConnection
local_cloud = CloudburstConnection('127.0.0.1', '127.0.0.1', local=True)

Running from the python3 repl, I get stuck in the CloudburstConnection line.

In the function execution doc, it instructs

local = True # or False if you are running against a HydroCluster
elb_address = '127.0.0.1 ' # or the address of the ELB returned by the 
from cloudburst.client.client import CloudburstConnectioncloudburst = CloudburstConnection(AWS_FUNCTION_ELB, MY_IP, local=local)

Do you need a elb for this? Can you run this locally using '127.0.0.1', '127.0.0.1'? Both examples set local=True

Allow custom serialization techniques

If the user wants to pass in and receive bytestreams, we should allow that. This means that each function would have to be tagged with the metadata (that it does its own serialization), and at a first cut, the user would have to specify at call time not to do any of our own serialization. Ideally, when you create or retrieve the function, you would actually get some metadata inside the CloudburstFunction object that would automatically tell the client to serialize or not.

DAG for conditional and parallel operations

I am trying to understand how DAG works for conditional and parallel operators,
specifically say, the below design

func1 = cb.register(lambda _, x: x * x, 'square')
func2 = cb.register(lambda _, x: x + 1, 'increment')
func3 = cb.register(lambda _, x: x % 2, 'choice')

# based on value returned by func5, I need to call either square or half
edge1 = ('choice', 'square')
edge2 = ('choice', 'increment')

cb.register_dag('branch', ['choice','square','increment'], [edge1, edge2])
cb.call_dag('branch', { 'choice': random.randint(1,2) }).get()

Though the DAG is correct, I am not getting the expected functionality. For e.g. When choice returns 0, I want square to be called and vice-versa, not both. Can you please point to references to achieve this on cloudburst? I would also like to achieve switch case functionality using this pattern.
On adding a sink function, it expects 3 parameters(1 dummy and 2 from each upstream function), which is not what I want, as only 1 path must execute.

Similarly for fork-join pattern, this is the design

func4 = func5 = cb.register(lambda _, x: x, 'fork')
func5 = cb.register(lambda _, x,y: x + y, 'merge')
edge1 = ('fork', 'square')
edge2 = ('fork', 'increment')
edge3 = ('square', 'merge')
edge4 = ('increment', 'merge')

cb.register_dag('f-m', ['fork','square','increment', 'merge'], [edge1, edge2, edge3, edge4])
cb.call_dag('f-m', { 'fork': random.randint(1,2) }).get()

This gives me expected results, What I wanted to know is if they execute in parallel, as their upstream functions complete at the same time.

External dependency management

We need a way to manage external Python dependencies. Potential designs:

Lambda-style zip file with Python dependencies
Allowing users to run apt-get or pip -- introduces interesting questions around knowing when things are cached where.
Allow deployment of custom containers.

Questions about codes in cloudburst/server/scheduler/policy/default_policy.py

Hi there, hope everyone is doing ok during this hard time of pandemic. Wish you all the best.
So, I was following the steps from tutorial and running the codes in client interface of CloudburstConnection,
Then I got this attribute errors. It shows that a list object is trying to call discard(), which is a function from python set.
So it seems that in cloudburst/server/scheduler/policy/default_policy.py, self.function_locations[function_name] should be a set of executors, but somehow during the runtime it turned to a list.
So, I was wondering whether this could be a problem?

local_cloud.register_dag('dag', ['square'], [])
(True, 0)
Traceback (most recent call last):
File "cloudburst/server/scheduler/server.py", line 350, in
sched_conf['policy'])
File "cloudburst/server/scheduler/server.py", line 231, in scheduler
policy.process_status(status)
File "/Users/cosmo/Desktop/Files/research/cloudburst/cloudburst/server/scheduler/policy/default_policy.py", line 377, in process_status
self.function_locations[function_name].discard(key)
AttributeError: 'list' object has no attribute 'discard'

Cloudburst connection to Anna running on Local

Hi team,
I have setup cloudburst to run given benchmarks both running on local machine. My goal is to run benchmarks on local that are available under cloudburst/server. Unfortunately, I was unable to figure out what is missing to get responses from anna.

After I start a benchmark, for example composition.py it runs fine until registering functions but never returns/timeout after this when try to call get() from anna.

I guess the issue is with ports configurations that I did not change and all ports are free (that are possibly used by anna or cloudburst by default).
Also the ones cloudburst need here.
I am also able to run the example given in main readme of cloudburst just fine.

My configurations:
Anna seems running fine on local and I could GET and PUT in kvs.

~/sites » ps | grep anna     
3162 ttys005  116:36.56 ./build/target/kvs/anna-monitor
3163 ttys005    0:07.80 ./build/target/kvs/anna-route
3164 ttys005  116:44.17 ./build/target/kvs/anna-kvs

Cloudburst executor and scheduler are also running

~/sites » ps ax | grep python 
 4638   ??  S      0:01.84 python3 cloudburst/server/scheduler/server.py conf/cloudburst-local.yml
 4639   ??  S      0:01.85 python3 cloudburst/server/executor/server.py conf/cloudburst-local.yml

I can also post recent logs from anna and cloudburst if that helps but there was no error or info message.

I wonder what am i missing to get results from benchmarks? Is there already a documentation about connecting both? I would be happy to improve any documentation once the issue is resolved :)

Pip install -r requirements.txt vs Pip3 install -r requirements

I'm following the getting started guide, which uses pip install -r requirements.txt, I wasn't able to run it but using pip3 install -r requirements.txt does.

This is on a new Ubuntu 18.04 LTS EC2 t2.micro instance.

pip install -r requirements.txt
... # downloads first few python packages

Could not find a version that satisfies the requirement pandas==0.25.1 (from -r requirements.txt (line 5)) (from versions: 0.1, 0.2b0, 0.2b1, 0.2, 0.3.0b0, 0.3.0b2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0rc1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0rc1, 0.8.0rc2, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0rc1, 0.19.0, 0.19.1, 0.19.2, 0.20.0rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0rc1, 0.21.0, 0.21.1, 0.22.0, 0.23.0rc2, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0rc1, 0.24.0, 0.24.1, 0.24.2)
No matching distribution found for pandas==0.25.1 (from -r requirements.txt (line 5))

Hydro registered function cannot return empty list

cloud_return_number().get() -> 42
cloud_return_empty_list().get() -> <hangs forever>

Function executor container crashes

I am trying to run Cloudburst in cluster mode on AWS following the Getting Started Guide (in mesh networking mode without a domain), but one function container seems to be caught in a crash-loop.

The logs of container function-1 say that the address is already in use:

Copying flow.egg-info to /usr/local/lib/python3.6/dist-packages/flow-0.1.0-py3.6.egg-info
running install_scripts
Traceback (most recent call last):
  File "cloudburst/server/executor/server.py", line 497, in <module>
    int(exec_conf['thread_id']))
  File "cloudburst/server/executor/server.py", line 59, in executor
    pin_socket.bind(sutils.BIND_ADDR_TEMPLATE % (sutils.PIN_PORT + thread_id))
  File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

I can run functions on the remaining function executors without any problem, but when I try to register a DAG (the one from the example), the scheduler container crashes (error see below), which seems to happen because there are no candidates and could be caused by the function container that hasn't started.

Traceback (most recent call last):
  File "cloudburst/server/scheduler/server.py", line 346, in <module>
    scheduler(conf['ip'], conf['mgmt_ip'], sched_conf['routing_address'])
  File "cloudburst/server/scheduler/server.py", line 181, in scheduler
    call_frequency)
  File "/hydro/cloudburst/cloudburst/server/scheduler/create.py", line 86, in create_dag
    success = policy.pin_function(dag.name, fref, colocated)
  File "/hydro/cloudburst/cloudburst/server/scheduler/policy/default_policy.py", line 249, in pin_function
    node, tid = sys_random.sample(candidates, 1)[0]
  File "/usr/lib/python3.6/random.py", line 320, in sample
    raise ValueError("Sample larger than population or is negative")
ValueError: Sample larger than population or is negative

What am I doing wrong?

Easily enable local mode

There are currently two or three places where we have to change code in order to be able to run Cloudburst in local mode. In particular, this requires relaxing the scheduling and executor constraints on the number of pinned functions per thread.

@saurav-c, I know you've made these changes before, so can you wrap those changes in an if statement that relies on a local mode flag (specified in the config YAML file), so we don't have to change the code to run locally? Let me know if you have any questions.

Routing node segfaults when receiving a key request whose key field is an empty string

Update dependencies

Working with this project is 90% fighting old dependencies. For instance, right now I'm trying to run a client in a fast-api docker and run into this obscure issue: protocolbuffers/protobuf#5435.

Previously I've had to:

Figure out which python version you developed this project for (because pyarrow 0.14 is broken for newer versions)
Discover pyenv
Install and configure pyenv
Realise that protoc and protobuf are two different things
Change the protoc version in common/scripts/install-dependencies.sh to match protobuf
Remove the preinstalled protoc version in the hydroproject/base docker image such that my modified install-dependencies.sh would install the correct version

Having Trouble with Clustered Cloudburst

Hello!

I've been trying to get Cloudburst working in cluster mode to reproduce some of its results (particularly Figures 11-12 from the VLDB paper) and I'm having some trouble. I'm running all experiments on AWS in us-east-1 on an Ubuntu 20.04 instance. I created a Hydro cluster successfully using the suggested command:

python3 -m hydro.cluster.create_cluster -m 1 -r 1 -f 1 -s 1

I then tried to connect to the cluster with Cloudburst. Installing the Cloudburst Python requirements didn't work (Python 3.8.10, Pip 20.0.2)--the Pandas installation failed and the required PyArrow version (0.14.1) didn't exist. To fix this, I used the most recent versions of Pandas and PyArrow, which both installed successfully. I installed all other dependencies (including Anna) successfully using the provided scripts. I then tried to connect to a cluster and perform a simple operation using the suggested commands:

>>> elb_address = "a91cbd914819441efb3414f96e794c2a-891365270.us-east-1.elb.amazonaws.com" # the function service IP from the hydro cluster
>>> my_ip = "172.31.70.53" # the internal IP address of my EC2 instance
>>> from cloudburst.client.client import CloudburstConnection
>>> cloudburst = CloudburstConnection(elb_address , my_ip , local=False)
>>> incr = lambda _, a: a + 1
>>> cloud_incr = cloudburst.register(incr, 'incr')
>>> cloud_incr(1).get()

The first few commands all work, but the final command hangs. Here's a stack trace of where it's hanging:

File "<stdin>", line 1, in <module>
File "/home/ubuntu/cloudburst/cloudburst/shared/future.py", line 23, in get
  obj = self.kvs_client.get(self.obj_id)[self.obj_id]
File "/usr/local/lib/python3.8/dist-packages/anna/client.py", line 85, in get
  worker_addresses[key] = (self._get_worker_address(key))
File "/usr/local/lib/python3.8/dist-packages/anna/client.py", line 247, in _get_worker_address
  addresses = self._query_routing(key, port)
File "/usr/local/lib/python3.8/dist-packages/anna/client.py", line 278, in _query_routing
  response = recv_response([key_request.request_id],
File "/usr/local/lib/python3.8/dist-packages/anna/zmq_util.py", line 27, in recv_response
  resp = rcv_sock.recv()
File "zmq/backend/cython/socket.pyx", line 788, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 824, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 186, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/checkrc.pxd", line 12, in zmq.backend.cython.checkrc._check_rc

Do you know what's causing the problem and how to fix it? Thank you!

Some doubts about when fref batching=True and fref.type=MULTIEXEC

I have some doubts about fref batching=True and fref.type=MULTIEXEC. In executor/server.py, when the executor receives exec_dag_socket, if batching=True, so I see that multiple trigger information will be obtained. At this time, if the type of fref is MULTIEXEC, [trigger] is assigned to triggers in the code. The trigger here is the last socket information in the previous loop, and the value of trigger is not updated in the subsequent loop, which will cause the following The same trigger value is used for each call. I wonder if there is a problem with my understanding?I pasted the code snippet at the back.Thanks for your help.

        for key in trigger_keys:
                # print(key)
                # print(received_triggers[key])
                # print(trigger)
                if (len(received_triggers[key]) == len(schedule.triggers)) or fref.type == MULTIEXEC: 

                    if fref.type == MULTIEXEC:
                        triggers = [trigger] 
                    else:
                        triggers = list(received_triggers[key].values())

KVS server throws zmq error after the routing node fails

Function execution extremely slow within first minute of pod re-creation

After having to delete and remake the pods, I tried running a benchmark as soon as they were ready.

These logs are one second per line. (Usually 100 users takes 2 seconds.)

INFO:root:Making 100 users...                                                                                                     
INFO:root:Currently making user 2.                                                                             
INFO:root:Currently making user 3.                                                                                          
INFO:root:Currently making user 4.                                                                                                     
INFO:root:Currently making user 5.                                                                              
INFO:root:Currently making user 6.                                                                                                
INFO:root:Currently making user 7.                                                                                                     
INFO:root:Currently making user 93.                                                                                
INFO:root:Doing 10 follows per user...                                                                                                 
INFO:root:Currently at 13 following 4th target.                                                                                     
INFO:root:Currently at 26 following 2th target.                                                                                
INFO:root:Currently at 38 following 6th target.                                                                             
INFO:root:Currently at 50 following 6th target.                                                                                     
INFO:root:Currently at 63 following 0th target.                                                                             
INFO:root:Currently at 75 following 7th target.                                                                                       
INFO:root:ccc_user_follow(79, 58) -> ('ERROR', b'\x18\x01')

Note also that the user follow fails for some reason - this has never happened in tens of thousands of executions of this function in previous benchmarks.

Function nodes failed to start.

I setup the cluster by following hydro-cluster docs. The function nodes failed to start. Here are the logs:

[ec2-user@ip-172-31-24-229 hydro-project]$ kubectl get pods
NAME                    READY   STATUS             RESTARTS        AGE
function-nodes-s7r2k    1/4     CrashLoopBackOff   785 (18s ago)   21h
management-pod          1/1     Running            0               22h
memory-nodes-tvxjn      1/1     Running            1 (22h ago)     22h
monitoring-pod          1/1     Running            0               22h
routing-nodes-wjdch     1/1     Running            1 (22h ago)     22h
scheduler-nodes-hwk8m   1/1     Running            0               22h

[ec2-user@ip-172-31-24-229 hydro-project]$ kubectl logs function-nodes-s7r2k
Defaulted container "function-1" out of: function-1, function-2, function-3, cache-container
eth0: error fetching interface information: Device not found
From https://github.com/hydro-project/anna
 * [new branch]      master     -> origin/master
Switched to a new branch 'brnch'
Branch 'brnch' set up to track remote branch 'master' from 'origin'.
Synchronizing submodule url for 'common'
running install
running build
running build_py
creating build
creating build/lib
creating build/lib/anna
copying anna/__init__.py -> build/lib/anna
copying anna/lattices.py -> build/lib/anna
copying anna/base_client.py -> build/lib/anna
copying anna/client.py -> build/lib/anna
copying anna/zmq_util.py -> build/lib/anna
copying anna/common.py -> build/lib/anna
copying anna/cloudburst_pb2.py -> build/lib/anna
copying anna/causal_pb2.py -> build/lib/anna
copying anna/anna_pb2.py -> build/lib/anna
copying anna/shared_pb2.py -> build/lib/anna
running install_lib
copying build/lib/anna/cloudburst_pb2.py -> /usr/local/lib/python3.6/dist-packages/anna
copying build/lib/anna/causal_pb2.py -> /usr/local/lib/python3.6/dist-packages/anna
copying build/lib/anna/anna_pb2.py -> /usr/local/lib/python3.6/dist-packages/anna
copying build/lib/anna/shared_pb2.py -> /usr/local/lib/python3.6/dist-packages/anna
byte-compiling /usr/local/lib/python3.6/dist-packages/anna/cloudburst_pb2.py to cloudburst_pb2.cpython-36.pyc
byte-compiling /usr/local/lib/python3.6/dist-packages/anna/causal_pb2.py to causal_pb2.cpython-36.pyc
byte-compiling /usr/local/lib/python3.6/dist-packages/anna/anna_pb2.py to anna_pb2.cpython-36.pyc
byte-compiling /usr/local/lib/python3.6/dist-packages/anna/shared_pb2.py to shared_pb2.cpython-36.pyc
running install_egg_info
running egg_info
creating Anna.egg-info
writing Anna.egg-info/PKG-INFO
writing dependency_links to Anna.egg-info/dependency_links.txt
writing requirements to Anna.egg-info/requires.txt
writing top-level names to Anna.egg-info/top_level.txt
writing manifest file 'Anna.egg-info/SOURCES.txt'
reading manifest file 'Anna.egg-info/SOURCES.txt'
writing manifest file 'Anna.egg-info/SOURCES.txt'
removing '/usr/local/lib/python3.6/dist-packages/Anna-0.1-py3.6.egg-info' (and everything under it)
Copying Anna.egg-info to /usr/local/lib/python3.6/dist-packages/Anna-0.1-py3.6.egg-info
running install_scripts
From https://github.com/hydro-project/cloudburst
 * [new branch]      aft-support -> origin/aft-support
 * [new branch]      dependabot/pip/protobuf-3.15.0 -> origin/dependabot/pip/protobuf-3.15.0
 * [new branch]      dependabot/pip/pyyaml-5.4 -> origin/dependabot/pip/pyyaml-5.4
 * [new branch]      master      -> origin/master
Switched to a new branch 'brnch'
Branch 'brnch' set up to track remote branch 'master' from 'origin'.
Synchronizing submodule url for 'common'
Traceback (most recent call last):
  File "cloudburst/server/executor/server.py", line 504, in <module>
    int(exec_conf['thread_id']))
  File "cloudburst/server/executor/server.py", line 104, in executor
    status.ip = ip
TypeError: None has type NoneType, but expected one of: bytes, unicode

Function still cannot return empty list

When I write a function that returns an empty list, it seems like the return value is never stored in the kvs, causing the .get() to hang forever.

>>> res.get()
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/.local/lib/python3.6/site-packages/droplet/shared/future.py", line 26, in get
    obj = self.kvs_client.get(self.obj_id)[self.obj_id]
  File "/home/ubuntu/.local/lib/python3.6/site-packages/anna/client.py", line 106, in get
    KeyResponse)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/anna/zmq_util.py", line 27, in recv_response
    resp = rcv_sock.recv()
  File "zmq/backend/cython/socket.pyx", line 788, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 824, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 186, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 12, in zmq.backend.cython.checkrc._check_rc
KeyboardInterrupt

>>> res.kvs_client.get(res.obj_id)
{'2e1d6152-5f96-4f18-a6e8-bfc60e79a0c5': None}

(This is distinct from the last version of this bug, where we were checking for "not obj" client-side instead of explicitly checking for None.)

pip3 install -r requirements.txt : Could not find a version that satisfies the requirement pyarrow==0.14.1

Hello!

I am currently trying to setup cloudburst but when installing the dependencies via pip3 install -r requirements.txt the following happens:

sudo pip3 install -r requirements.txt
Collecting cloudpickle==0.6.1
Downloading cloudpickle-0.6.1-py2.py3-none-any.whl (14 kB)
Collecting coverage==4.5.4
Downloading coverage-4.5.4.tar.gz (385 kB)
|████████████████████████████████| 385 kB 1.0 MB/s
Collecting flake8==3.7.7
Downloading flake8-3.7.7-py2.py3-none-any.whl (68 kB)
|████████████████████████████████| 68 kB 2.1 MB/s
Collecting numpy==1.16.1
Downloading numpy-1.16.1.zip (5.1 MB)
|████████████████████████████████| 5.1 MB 1.1 MB/s
Collecting pandas==0.25.1
Downloading pandas-0.25.1.tar.gz (12.6 MB)
|████████████████████████████████| 12.6 MB 1.6 MB/s
Collecting protobuf==3.6.1
Downloading protobuf-3.6.1-py2.py3-none-any.whl (390 kB)
|████████████████████████████████| 390 kB 2.7 MB/s
ERROR: Could not find a version that satisfies the requirement pyarrow==0.14.1 (from -r requirements.txt (line 7)) (from versions: 0.9.0, 0.10.0, 0.11.0, 0.11.1, 0.12.0, 0.12.1, 0.13.0, 0.14.0, 0.15.1, 0.16.0, 0.17.0, 0.17.1, 1.0.0, 1.0.1, 2.0.0, 3.0.0, 4.0.0, 4.0.1, 5.0.0, 6.0.0, 6.0.1, 7.0.0, 8.0.0)
ERROR: No matching distribution found for pyarrow==0.14.1 (from -r requirements.txt (line 7))

I am assuming, this happens because pyarrow 0.14.1 is pinned within protobuf 3.6.1 but is no longer available since :

sudo pip install pyarrow==0.14.1
ERROR: Could not find a version that satisfies the requirement pyarrow==0.14.1 (from versions: 0.9.0, 0.10.0, 0.11.0, 0.11.1, 0.12.0, 0.12.1, 0.13.0, 0.14.0, 0.15.1, 0.16.0, 0.17.0, 0.17.1, 1.0.0, 1.0.1, 2.0.0, 3.0.0, 4.0.0, 4.0.1, 5.0.0, 6.0.0, 6.0.1, 7.0.0, 8.0.0)
ERROR: No matching distribution found for pyarrow==0.14.1

can this be fixed by simply using a newer version of protobuf?

Best regards,
Florian

Outdated `pyarrow` version in `requirements.txt`

It seems that pyarrow 0.14.1 is not available via pip anymore.

Docker container stuck on fetching a repo when starting

The docker image builds fine from source, but when running it is immediately stuck on on fetching a repo with the error message fatal: could not read Username for 'https://github.com': No such device or addres. Note that this happens inside the docker image, so fixing this error like suggested here (https://stackoverflow.com/questions/28238037/git-log-out-user-from-command-line) won't work.

Fresh clone doesn't build on OS X

I'm struggling to build and run cloudburst from a fresh clone of the repo.

I have protobuf installed using brew install protobuf.

Build:

$ ./scripts/build.sh
common/proto: warning: directory does not exist.
Could not make proto path relative: cloudburst.proto: No such file or directory
common/proto: warning: directory does not exist.
Could not make proto path relative: anna.proto: No such file or directory
sed: -i may not be used with stdin
sed: -i may not be used with stdin
sed: -i may not be used with stdin

Note: the common directory exists, but there's nothing in it.

Run:

$ pip3 install -r requirements.txt
Requirement already satisfied: cloudpickle==0.6.1 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 1)) (0.6.1)
Requirement already satisfied: coverage==4.5.4 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 2)) (4.5.4)
Requirement already satisfied: flake8==3.7.7 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (3.7.7)
Requirement already satisfied: numpy==1.16.1 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 4)) (1.16.1)
Requirement already satisfied: pandas==0.25.1 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 5)) (0.25.1)
Requirement already satisfied: protobuf==3.6.1 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 6)) (3.6.1)
Requirement already satisfied: pyarrow==0.14.1 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 7)) (0.14.1)
Requirement already satisfied: pycodestyle==2.5.0 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 8)) (2.5.0)
Requirement already satisfied: PyYAML==5.1.2 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 9)) (5.1.2)
Requirement already satisfied: pyzmq==17.1.2 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 10)) (17.1.2)
Requirement already satisfied: zmq==0.0.0 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 11)) (0.0.0)
Requirement already satisfied: setuptools==41.0.1 in /usr/local/lib/python3.7/site-packages (from -r requirements.txt (line 12)) (41.0.1)
Requirement already satisfied: mccabe<0.7.0,>=0.6.0 in /usr/local/lib/python3.7/site-packages (from flake8==3.7.7->-r requirements.txt (line 3)) (0.6.1)
Requirement already satisfied: entrypoints<0.4.0,>=0.3.0 in /usr/local/lib/python3.7/site-packages (from flake8==3.7.7->-r requirements.txt (line 3)) (0.3)
Requirement already satisfied: pyflakes<2.2.0,>=2.1.0 in /usr/local/lib/python3.7/site-packages (from flake8==3.7.7->-r requirements.txt (line 3)) (2.1.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.7/site-packages (from pandas==0.25.1->-r requirements.txt (line 5)) (2.7.3)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/site-packages (from pandas==0.25.1->-r requirements.txt (line 5)) (2019.3)
Requirement already satisfied: six>=1.9 in /usr/local/lib/python3.7/site-packages (from protobuf==3.6.1->-r requirements.txt (line 6)) (1.11.0)

According to Getting Started:

$ ./scripts/start-cloudburst-local.sh
Usage: ././scripts/start-cloudburst-local.sh build

You must run this from the project root directory.

Ok, then:

$ ./scripts/start-cloudburst-local.sh build
Traceback (most recent call last):
  File "cloudburst/server/executor/server.py", line 20, in <module>
    from anna.client import AnnaTcpClient
ModuleNotFoundError: No module named 'anna'

This is on OS X Mojave. Hope this helps. Have you considered a Makefile with targets such as make and make run-local, etc?

Creating cluster creates error

python3 -m hydro.cluster.create_cluster -m 1 -r 1 -f 1 -s 1
Creating cluster object...
Adding general instance group
Creating cluster on AWS...
Validating cluster...
Creating management pods...
Creating 1 routing nodes...
Adding 1 routing server node(s) to cluster...
Validating cluster...
Creating 1 memory, 0 ebs node(s)...
Adding 1 memory server node(s) to cluster...
Adding 0 ebs server node(s) to cluster...
Validating cluster...
Creating routing service...
Adding 1 scheduler nodes...
Adding 1 scheduler server node(s) to cluster...
Validating cluster...
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/websocket/_http.py", line 143, in _get_addrinfo_list
    hostname, port, 0, 0, socket.SOL_TCP)
  File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/stream/ws_client.py", line 253, in websocket_call
    client = WSClient(configuration, get_websocket_url(url), headers)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/stream/ws_client.py", line 76, in __init__
    self.sock.connect(url, header=header)
  File "/usr/local/lib/python3.6/dist-packages/websocket/_core.py", line 223, in connect
    options.pop('socket', None))
  File "/usr/local/lib/python3.6/dist-packages/websocket/_http.py", line 113, in connect
    hostname, port, is_secure, proxy)
  File "/usr/local/lib/python3.6/dist-packages/websocket/_http.py", line 154, in _get_addrinfo_list
    raise WebSocketAddressException(e)
websocket._exceptions.WebSocketAddressException: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/hydro-project/cluster/hydro/cluster/create_cluster.py", line 228, in <module>
    aws_key_id, aws_key)
  File "/home/ubuntu/hydro-project/cluster/hydro/cluster/create_cluster.py", line 124, in create_cluster
    BATCH_SIZE, prefix)
  File "/home/ubuntu/hydro-project/cluster/hydro/cluster/add_nodes.py", line 129, in batch_add_nodes
    prefix)
  File "/home/ubuntu/hydro-project/cluster/hydro/cluster/add_nodes.py", line 116, in add_nodes
    '/hydro/anna/conf/', cname)
  File "/home/ubuntu/hydro-project/cluster/hydro/shared/util.py", line 140, in copy_file_to_pod
    _preload_content=False, container=container)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/stream/stream.py", line 36, in stream
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 835, in connect_get_namespaced_pod_exec
    (data) = self.connect_get_namespaced_pod_exec_with_http_info(name, namespace, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 935, in connect_get_namespaced_pod_exec_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 168, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/stream/stream.py", line 31, in _intercept_request_call
    return ws_client.websocket_call(config, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/kubernetes/stream/ws_client.py", line 259, in websocket_call
    raise ApiException(status=0, reason=str(e))
kubernetes.client.rest.ApiException: (0)
Reason: [Errno -2] Name or service not known

Running kubectl get pods shows that the scheduler pod has crashed. The last few lines of its logs are:

Traceback (most recent call last):
  File "cloudburst/server/scheduler/server.py", line 350, in <module>
    sched_conf['policy'])
KeyError: 'policy'

ZMQError(Resource temporarily unavailable) when CloudburstConnection

hello, cloudburst is a good project! I want run test in local mode but meet some error when CloudburstConnection. Here, I print some log in cloudburst/client/client.py, the code is as follows:

 def _connect(self):
        sckt = self.context.socket(zmq.REQ)
        sckt.setsockopt(zmq.RCVTIMEO, 1000)
        sckt.connect(self.service_addr % CONNECT_PORT)
        sckt.send_string('')
        
        print("before-try")

        try:
            print("try")
            result = sckt.recv_string()
            print("after-try")
            return result
        
        except zmq.ZMQError as e:
            print("zmqerro")
            print(e)
            if e.errno == zmq.EAGAIN:
                return None
            else:
                raise e

the test result:

$ python3                   
Python 3.7.0 (default, Mar  4 2022, 02:48:37) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from cloudburst.client.client import CloudburstConnection
>>> local_cloud = CloudburstConnection('127.0.0.1', '127.0.0.1', local=True)
before-try
try
zmqerro
Resource temporarily unavailable
Connection timed out, retrying
before-try
try
zmqerro
Resource temporarily unavailable
Connection timed out, retrying
before-try
try
zmqerro
Resource temporarily unavailable
Connection timed out, retrying
before-try
try
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chunpu/cloudburst/cloudburst/client/client.py", line 66, in __init__
    kvs_addr = self._connect()
  File "/home/chunpu/cloudburst/cloudburst/client/client.py", line 341, in _connect
    result = sckt.recv_string()
  File "/home/chunpu/.pyenv/versions/3.7.0/lib/python3.7/site-packages/zmq/sugar/socket.py", line 584, in recv_string
    msg = self.recv(flags=flags)
  File "zmq/backend/cython/socket.pyx", line 788, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 824, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 186, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 12, in zmq.backend.cython.checkrc._check_rc
KeyboardInterrupt

port 5000 is open

$ lsof -i:5000
COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
python3 54472 chunpu   29u  IPv4 74598688      0t0  TCP *:5000 (LISTEN)

and server is running:

$ ps -ef | grep cloud  
chunpu   54472     1  0 11:22 pts/20   00:00:06 /home/chunpu/.pyenv/versions/3.7.0/bin/python3 cloudburst/server/scheduler/server.py conf/cloudburst-local.yml
chunpu   54473     1  0 11:22 pts/20   00:00:06 /home/chunpu/.pyenv/versions/3.7.0/bin/python3 cloudburst/server/executor/server.py conf/cloudburst-local.yml
chunpu   61251 52289  0 11:48 pts/19   00:00:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox cloud

so what is the reason of ZMQError(Resource temporarily unavailable), I have confused about it for a long time, can you help me?
Thank you!

List objects returned by functions do not have their order preserved

>>> echo = dc.get_function('echo')
>>> res = echo([3,2,1])
>>> res.get()
[1, 2, 3]

Presumably this is due to an inappropriate use of OrderedSetLattice?