datarevenue-berlin / code-challenge-2020 Goto Github PK

Data science technical interview

Shell 2.52% Dockerfile 3.45% Python 94.03%

code-challenge-2020's Introduction

Datarevenue Code Challenge

Congratulations for making it to the Data Revenue Code Challenge 2020. This coding challenge will be used to evaluate your technical as well as your communication skills.

You will need docker an docker-compose to run this repository:

Goals

The repository you see here is a minimal local version of our usual task orchestration pipeline. We run everything in docker containers. So each task must expose its functionality via a CLI. We then use luigi to spin up the containers and pass the necessary arguments to each container. See more details here.

The repository already comes with a leaf task implemented which will download the data set for you.

The goal of this challenge is to implement a complete machine learning pipeline. This pipeline should build a proof of concept machine learning model and evaluate it on a test data set.

An important part of this challenge is to assess and explain the model to a fictional client with limited statistical knowledge. So your evaluation should include some plots on how your model makes the predictions. Finally you need to give an essesment if it will make sense for the client to implement this model!

Challenge

To put things into the right perspective consider the following fictional scenario:

You are a AI Consultant at Data Revenue. One of our clients is a big online wine seller. After a successful strategic consulting we advice the client to optimize his portfolio by creating a rating predictor (predict points given to a wine) for his inventory. We receive a sample dataset (10k rows) from the client and will come back in a week to evaluate our model on a bigger data set that is only accessible from on-premise servers (>100k rows).

The task is to show that a good prediction is possible and thereby make it less risky to implement a full production solution. Our mini pipeline should later be able to run on their on premise machine which has only docker and docker-compose installed.

Data set

Here is an excerpt of dataset you will be working on:

country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
Italy	Fragrances suggest hay, crushed tomato vine and exotic fruit. The bright but structured palate delivers peach, papaya, cantaloupe and energizing mineral notes alongside fresh acidity. It's nicely balanced with good length,	Kirchleiten	90	30.0	Northeastern Italy	Alto Adige		Kerin O’Keefe	@kerinokeefe	Tiefenbrunner 2012 Kirchleiten Sauvignon (Alto Adige)	Sauvignon	Tiefenbrunner
France	Packed with fruit and crisp acidity, this is a bright, light and perfumed wine. Red-berry flavors are lifted by red currants and a light spice. Drink now for total freshness.		87	22.0	Loire Valley	Sancerre		Roger Voss	@vossroger	Bernard Reverdy et Fils 2014 Rosé (Sancerre)	Rosé	Bernard Reverdy et Fils
Italy	This easy, ruby-red wine displays fresh berry flavors and a light, crisp mouthfeel. Pair this no-fuss wine with homemade pasta sauce or potato gnocchi and cheese.		86		Tuscany	Chianti Classico				Dievole 2009 Chianti Classico	Sangiovese	Dievole
US	Pretty in violet and rose petals this is a lower-octane Pinot Noir for the winery. Exquisitely rendered in spicy dark cherry and soft, supple tannins, it hails from a cool, coastal vineyard site 1,000 feet atop Occidental Ridge, the coolest source of grapes for Davis.	Horseshoe Bend Vineyard	92	50.0	California	Russian River Valley	Sonoma	Virginie Boone	@vboone	Davis Family 2012 Horseshoe Bend Vineyard Pinot Noir (Russian River Valley)	Pinot Noir	Davis Family
US	This golden wine confounds in a mix of wet stone and caramel on the nose, the body creamy in vanilla. Fuller in style and body than some, it remains balanced in acidity and tangy citrus, maintaining a freshness and brightness throughout. The finish is intense with more of that citrus, plus an accent of ginger and lemongrass.	Dutton Ranch	93	38.0	California	Russian River Valley	Sonoma	Virginie Boone	@vboone	Dutton-Goldfield 2013 Dutton Ranch Chardonnay (Russian River Valley)	Chardonnay	Dutton-Goldfield
US	This is a lush, rich Chardonnay with especially ripe pineapple, peach and lime flavors, as well as a coating of oaky, buttered toast.	Signature Selection	84	14.0	California	Dry Creek Valley	Sonoma			Pedroncelli 2012 Signature Selection Chardonnay (Dry Creek Valley)	Chardonnay	Pedroncelli
US	Intensely aromatic of exotic spice, potpourri and dried fig, this dry Gewürztraminer is a bit atypical, but thought provoking and enjoyable. Lemon and apple flavors have a slightly yeasty tone, but brisk acidity and puckering tea-leaf tannins lend elegance and balance.	Spezia	87	25.0	New York	North Fork of Long Island	Long Island	Anna Lee C. Iijima		Anthony Nappa 2013 Spezia Gewürztraminer (North Fork of Long Island)	Gewürztraminer	Anthony Nappa
US	Dry, acidic and tannic, in the manner of a young Barbera, but the flavors of cherries, blackberries and currants aren't powerful enough to outlast the astringency. Drink this tough, rustic wine now.		84	35.0	California	Paso Robles	Central Coast			Eagle Castle 2007 Barbera (Paso Robles)	Barbera	Eagle Castle
France	Gold in color, this is a wine with notes of spice, rich fruit and honey, which are all surrounded by intense botrytis. This is a wine that has great aging potential, and its superripeness develops slowly on the palate.		94		Bordeaux	Sauternes		Roger Voss	@vossroger	Château Lamothe Guignard 2009 Sauternes	Bordeaux-style White Blend	Château Lamothe Guignard
France	Steel and nervy mineralogy are the hallmarks of this wine at this stage. It's still waiting for the fruit to develop, but expect crisp citrus and succulent apples. The aftertaste, tensely fresh now, should soften as the wine develops. This 90% Sauvignon Blanc and 10% Sémillon blend comes from the estate's small vineyard on the slope near Cadillac.		88	12.0	Bordeaux	Bordeaux Blanc		Roger Voss	@vossroger	Château Boisson 2014 Bordeaux Blanc	Bordeaux-style White Blend	Château Boisson

Prerequisites

Before starting this challenge you should know:

How to train and evaluate a ML model.
Have solid understanding of the pandas library and ideally the dask parallel computing library.
How to run docker containers.
How to specify tasks and dependencies in Spotify's luigi.
Have read our TaC blogpost. This will be very helpful to understand this repo's architecture!

Requirements

To specify requirements better let's break this down into individual tasks.

1. DownloadData

We already got you covered and implemented this task for you.

2. Make(Train|Test)Dataset

We supply you with the scaffold for this task, so you can start and explore dask or simply go ahead with you usual pandas script.

Read the csv provided by DownloadData and transform it into a numerical matrix ready for your ML models.

Be aware that the dataset is just a sample from the whole dataset so the values in your columns might not represent all possible values.

At Data Revenue we use dask to parallelize Pandas operations. So we include also a running dask cluster which you can (you don't need to) use. Remember to partition your csv if you plan on using dask (by using blocksize).

Don't forget to split your data set according to best practices. So you might need more than a single task for this.

3. TrainModel

Choose a suitable model type and train it on your previously built data set. We like models that don't take forever to train. Please no DNN (this includes word2vec). For the sake of simplicity you can use fixed hyperparameters (hopefully "hand tuned"). Serialize your model to a file. If necessary this file can include metadata.

The final data set will have more than 100k rows.

4. EvaluateModel

Here you can get creative! Pick a good metric and show your communication and presentation skills. Load your model and evaluate it on a held out part of the data set. This task should have a concrete outcome e.g. a zip of plots or even better a whole report (check the pweave package).

You will most likely need the output of this task to tell the client if the model is suited for his endavour. This should include a assesment of the quality of the model, and also the consequences of the errors that the model makes.

Other requirements

Each task:
- Needs to be callable via the command line
- Needs to be documented
- Should have single file as output (if you have two consider putting them into a single file or use a .SUCCESS flag file as the tasks output)
Task images that aren't handled by docker-compose should be build and tagged in ./build-task-images.sh
Task images should be minimal to complete the task
The data produced by your tasks should be structured (directories and filename) sensibly inside ./data_root
Don't commit anything in ./data_root, use .gitignore
Your code should be PEP8 conform

Get Started

To get started execute the DownloadData task we provide this task already completely containerized for you. Let's first build the images, we have included a script so this is more streamlined:

./build-task-images.sh 0.1

Now to execute the pipeline simply run:

docker-compose up orchestrator

This will download the data for you. It might be a good idea to execute:

watch -n 0.1 docker ps

in a different terminal window to get a sense of what is going on.

We recommend to start developing in notebooks or you IDE locally if you're not very familiar with docker. This way we can consider your solution even if you don't get the whole pipeline running. Also don't hesitate to contact us if you hit a serious blocker instead of wasting too much time on it.

NOTE: Configure your docker network

Docker runs containers in their own networks. Compose automatically creates a network for each project. This project assumes that this network is named code-challenge-2020_default depending on your folder name and compose version this might not always be the case. You will get an error when trying to download the data if this network is named differently for you. If you run into this error, please execute: docker network ls and identify the correct network name. Next open the docker-compose.yml and edit the env variable on the orchestrator service.

Troubleshooting in Task Containers

We also included a Debug task for you which you may start if you need a shell inside a task's container. Make sure to adjust the correct image if you want to debug a task other then DownloadData. Then run:

docker-compose run orchestrator luigi --module task Debug --local-scheduler

this will spawn a task with luigi but set it to sleep for 3600 seconds. You can use that time to get a shell into the container, but first you need to find the containers name, so from a different terminal run:

docker ps

check for a container named debug-<something> then execute

docker exec -ti debug-<something> shell

Now you're in the container and can move around the filesystem execute commands etc. To exit simply type exit

Exposed Dashboards

This scaffold exposes 2 dashboards:

dask-scheduler @ http://localhost:8787. This let's you view how dask is executing your computation graph find more out here!
luigi-scheduler @ http://localhost:8082. This shows you you're high level task progress.

Evaluation Criteria

Your solution will be evaluated against following criteria:

Is it runnable? (or does it look runnable this shouldn't be binary) 15 points
ML Best Practices 15 points
Presentation of results (during interview) 20 points
Written report quality (before interview) 15 points
Code Quality (incl. Documentation and PEP8) 10 points
Structure/Modularity 10 points
Correct use of linux tools (dockerfiles, shellscripts) 10 points
Performance (concurrency, correct use of docker cache) 5 points

Task as Container TLDR;

This is a TLDR; of TaC blogpost

We spawn containers from a orchestrator container.
These spawned container run pipeline steps.
Services that need to be accessed by the containers are built and managed via docker-compose.
We see the orchestrator as a service.
To share data between containers we must tell the orchestrator where our project is located on the host machine. The orchestrator will then mount this directory into /usr/share/data in dynamically spawned containers.
To allow the orchestrator to spawn containers we must expose the hosts docker socket to it.

FAQ

Can I use notebooks?

Yes you are encouraged to use notebooks to do ad-hoc analysis. Please include them in your submission. Though having a pipeline set up in a notebook does not free you from submitting a working containerized pipeline.

What is the recommended way to develop this?

Just install all the needed packages in a conda-env or virtualenv and start developing in you favorite IDE or within the beloved jupyter notebook or both. Once you are happy with the results, expose your notebooks functionality in a CLI and package it with a Dockerfile.

Can I use other technologies? Such as R, Spark, Pyspark, Modin, etc.

Yes you can as long as you can provision the docker containers and spin up all the necessary services with docker-compose.

Do you accept partial submissions?

Yes you can submit you coding challenge partially finished in case you don't finish in time or have trouble with all the docker stuff. Unfinished challenges will be reviewed if some kind of model evaluation report is included (notebook or similar). You will lose points though as it will be considered as not runnable (no points in runnable category, no points in linux tools category and maximum 3 points in performance category).

I found a bug! What should I do?

Please contact us! We wrote this in a hurry and also make mistakes. PRs on bugs get you extra points ;)

I have another question!

Feel free to create an issue! Discussions in issues are generally encouraged.

Submission

The following artifacts/files are expected as deliverables

Your solution containing all files necessary to run it as a docker-compose project
Include a complete log of your local run of the complete pipeline
Include your client facing rendered report such as a (executed) notebook, rendered pweave report, or pdf

Please zip your solution including all files and send to us with the following naming schema:

cc20_<first_name>_<last_name>.zip

code-challenge-2020's People

Contributors

Stargazers

Watchers

code-challenge-2020's Issues

luigi: error: unrecognized arguments

In the file orchestrator/task.py there's a class (task) called "MakeDataset". This class gets luigi arguments by using underscore, for example, out_dir = luigi.Parameter(). It throws an exception when someone tries to run the task by using:
docker-compose run orchestrator luigi --module task MakeDataset --out_dir mydir --scheduler-host luigid

I've found the problem and solution here:
spotify/luigi#1728

Error Downloading dataset

In order the build the images and download the data, I executed ./build-task-images.sh 0.1, then I executed docker-compose up orchestrator, but i got this errors:

WARNING: The PWD variable is not set. Defaulting to a blank string.
Creating code-challenge-2019_luigid_1         ... done
Creating code-challenge-2019_dask-scheduler_1 ... done
Recreating code-challenge-2019_orchestrator_1 ... done
Attaching to code-challenge-2019_orchestrator_1
orchestrator_1    | DEBUG: Checking if DownloadData(no_remove_finished=False, fname=wine_dataset, out_dir=/usr/share/data/raw/, url=https://github.com/datarevenue-berlin/code-challenge-2019/releases/download/0.1.0/dataset_sampled.csv) is complete
orchestrator_1    | INFO: Informed scheduler that task   DownloadData_wine_dataset_False__usr_share_data__79bc385f2e   has status   PENDING
orchestrator_1    | INFO: Done scheduling tasks
orchestrator_1    | INFO: Running Worker with 1 processes
orchestrator_1    | DEBUG: Asking scheduler for work...
orchestrator_1    | DEBUG: Pending tasks: 1
orchestrator_1    | INFO: [pid 1] Worker Worker(salt=005178342, workers=1, host=49f018198416, username=root, pid=1) running   DownloadData(no_remove_finished=False, fname=wine_dataset, out_dir=/usr/share/data/raw/, url=https://github.com/datarevenue-berlin/code-challenge-2019/releases/download/0.1.0/dataset_sampled.csv)
orchestrator_1    | ERROR: [pid 1] Worker Worker(salt=005178342, workers=1, host=49f018198416, username=root, pid=1) failed    DownloadData(no_remove_finished=False, fname=wine_dataset, out_dir=/usr/share/data/raw/, url=https://github.com/datarevenue-berlin/code-challenge-2019/releases/download/0.1.0/dataset_sampled.csv)
orchestrator_1    | Traceback (most recent call last):
orchestrator_1    |   File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 261, in _raise_for_status
orchestrator_1    |     response.raise_for_status()
orchestrator_1    |   File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
orchestrator_1    |     raise HTTPError(http_error_msg, response=self)
orchestrator_1    | requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.35/containers/5ddaa92a0628f808540bcc84316fcb811524fcc25d238cc199a0e707adb5989d/start
orchestrator_1    | 
orchestrator_1    | During handling of the above exception, another exception occurred:
orchestrator_1    | 
orchestrator_1    | Traceback (most recent call last):
orchestrator_1    |   File "/usr/local/lib/python3.6/site-packages/luigi/worker.py", line 199, in run
orchestrator_1    |     new_deps = self._run_get_new_deps()
orchestrator_1    |   File "/usr/local/lib/python3.6/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
orchestrator_1    |     task_gen = self.task.run()
orchestrator_1    |   File "/opt/orchestrator/util.py", line 352, in run
orchestrator_1    |     self._run_and_track_task()
orchestrator_1    |   File "/opt/orchestrator/util.py", line 364, in _run_and_track_task
orchestrator_1    |     self.configuration,
orchestrator_1    |   File "/opt/orchestrator/util.py", line 195, in run_container
orchestrator_1    |     raise e
orchestrator_1    |   File "/opt/orchestrator/util.py", line 185, in run_container
orchestrator_1    |     **configuration)
orchestrator_1    |   File "/usr/local/lib/python3.6/site-packages/docker/models/containers.py", line 809, in run
orchestrator_1    |     container.start()
orchestrator_1    |   File "/usr/local/lib/python3.6/site-packages/docker/models/containers.py", line 400, in start
orchestrator_1    |     return self.client.api.start(self.id, **kwargs)
orchestrator_1    |   File "/usr/local/lib/python3.6/site-packages/docker/utils/decorators.py", line 19, in wrapped
orchestrator_1    |     return f(self, resource_id, *args, **kwargs)
orchestrator_1    |   File "/usr/local/lib/python3.6/site-packages/docker/api/container.py", line 1095, in start
orchestrator_1    |     self._raise_for_status(res)
orchestrator_1    |   File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 263, in _raise_for_status
orchestrator_1    |     raise create_api_error_from_http_exception(e)
orchestrator_1    |   File "/usr/local/lib/python3.6/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
orchestrator_1    |     raise cls(e, response=response, explanation=explanation)
orchestrator_1    | docker.errors.NotFound: 404 Client Error: Not Found ("network code_challenge_default not found")
orchestrator_1    | DEBUG: 1 running tasks, waiting for next task to finish
orchestrator_1    | INFO: Informed scheduler that task   DownloadData_wine_dataset_False__usr_share_data__79bc385f2e   has status   FAILED
orchestrator_1    | DEBUG: Asking scheduler for work...
orchestrator_1    | DEBUG: Done
orchestrator_1    | DEBUG: There are no more tasks to run at this time
orchestrator_1    | DEBUG: There are 1 pending tasks possibly being run by other workers
orchestrator_1    | DEBUG: There are 1 pending tasks unique to this worker
orchestrator_1    | DEBUG: There are 1 pending tasks last scheduled by this worker
orchestrator_1    | INFO: Worker Worker(salt=005178342, workers=1, host=49f018198416, username=root, pid=1) was stopped. Shutting down Keep-Alive thread
orchestrator_1    | INFO: 
orchestrator_1    | ===== Luigi Execution Summary =====
orchestrator_1    | 
orchestrator_1    | Scheduled 1 tasks of which:
orchestrator_1    | * 1 failed:
orchestrator_1    |     - 1 DownloadData(no_remove_finished=False, fname=wine_dataset, out_dir=/usr/share/data/raw/, url=https://github.com/datarevenue-berlin/code-challenge-2019/releases/download/0.1.0/dataset_sampled.csv)
orchestrator_1    | 
orchestrator_1    | This progress looks :( because there were failed tasks
orchestrator_1    | 
orchestrator_1    | ===== Luigi Execution Summary =====
orchestrator_1    | 
code-challenge-2019_orchestrator_1 exited with code 0

Cannot download data with docker-compose

Dear all,

when running "sudo docker-compose up orchestrator", I get the output posted below but no file appears in
/usr/share/data/raw/ (in fact there is no data/ directory in /usr/share). There is not file in data_root/raw either.

More info:
OS: Linux Mint 19.3 Tricia (based on Ubuntu Bionic)
Docker version 19.03.12, build 48a66213fe
docker-compose version 1.27.1, build 509cfb99

Using "docker network ls", I can see the network named "code-challenge-2020_default".

--
Here is the command output

$ sudo docker-compose up orchestrator
WARNING: The PWD variable is not set. Defaulting to a blank string.
Creating network "code-challenge-2020_default" with the default driver
Creating code-challenge-2020_dask-scheduler_1 ... done
Creating code-challenge-2020_luigid_1 ... done
Creating code-challenge-2020_orchestrator_1 ... done
Attaching to code-challenge-2020_orchestrator_1
orchestrator_1 | DEBUG: Checking if DownloadData(no_remove_finished=False, fname=wine_dataset, out_dir=/usr/share/data/raw/, url=https://github.com/datarevenue-berlin/code-challenge-2019/releases/download/0.1.0/dataset_sampled.csv) is complete
orchestrator_1 | WARNING: Failed connecting to remote scheduler 'http://luigid:8082'
orchestrator_1 | Traceback (most recent call last):
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 160, in _new_conn
orchestrator_1 | (self._dns_host, self.port), self.timeout, **extra_kw
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/urllib3/util/connection.py", line 84, in create_connection
orchestrator_1 | raise err
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/urllib3/util/connection.py", line 74, in create_connection
orchestrator_1 | sock.connect(sa)
orchestrator_1 | ConnectionRefusedError: [Errno 111] Connection refused
orchestrator_1 |
orchestrator_1 | During handling of the above exception, another exception occurred:
orchestrator_1 |
orchestrator_1 | Traceback (most recent call last):
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
orchestrator_1 | chunked=chunked,
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 392, in _make_request
orchestrator_1 | conn.request(method, url, **httplib_request_kw)
orchestrator_1 | File "/usr/local/lib/python3.6/http/client.py", line 1287, in request
orchestrator_1 | self._send_request(method, url, body, headers, encode_chunked)
orchestrator_1 | File "/usr/local/lib/python3.6/http/client.py", line 1333, in _send_request
orchestrator_1 | self.endheaders(body, encode_chunked=encode_chunked)
orchestrator_1 | File "/usr/local/lib/python3.6/http/client.py", line 1282, in endheaders
orchestrator_1 | self._send_output(message_body, encode_chunked=encode_chunked)
orchestrator_1 | File "/usr/local/lib/python3.6/http/client.py", line 1042, in _send_output
orchestrator_1 | self.send(msg)
orchestrator_1 | File "/usr/local/lib/python3.6/http/client.py", line 980, in send
orchestrator_1 | self.connect()
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 187, in connect
orchestrator_1 | conn = self._new_conn()
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 172, in _new_conn
orchestrator_1 | self, "Failed to establish a new connection: %s" % e
orchestrator_1 | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f1dbb74c320>: Failed to establish a new connection: [Errno 111] Connection refused
orchestrator_1 |
orchestrator_1 | During handling of the above exception, another exception occurred:
orchestrator_1 |
orchestrator_1 | Traceback (most recent call last):
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
orchestrator_1 | timeout=timeout
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 727, in urlopen
orchestrator_1 | method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 439, in increment
orchestrator_1 | raise MaxRetryError(_pool, url, error or ResponseError(cause))
orchestrator_1 | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='luigid', port=8082): Max retries exceeded with url: /api/add_task (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1dbb74c320>: Failed to establish a new connection: [Errno 111] Connection refused',))
orchestrator_1 |
orchestrator_1 | During handling of the above exception, another exception occurred:
orchestrator_1 |
orchestrator_1 | Traceback (most recent call last):
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/luigi/rpc.py", line 163, in _fetch
orchestrator_1 | response = self._fetcher.fetch(full_url, body, self._connect_timeout)
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/luigi/rpc.py", line 116, in fetch
orchestrator_1 | resp = self.session.post(full_url, data=body, timeout=timeout)
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 578, in post
orchestrator_1 | return self.request('POST', url, data=data, json=json, **kwargs)
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
orchestrator_1 | resp = self.send(prep, **send_kwargs)
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
orchestrator_1 | r = adapter.send(request, **kwargs)
orchestrator_1 | File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
orchestrator_1 | raise ConnectionError(e, request=request)
orchestrator_1 | requests.exceptions.ConnectionError: HTTPConnectionPool(host='luigid', port=8082): Max retries exceeded with url: /api/add_task (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1dbb74c320>: Failed to establish a new connection: [Errno 111] Connection refused',))
orchestrator_1 | INFO: Retrying attempt 2 of 3 (max)
orchestrator_1 | INFO: Wait for 30 seconds
orchestrator_1 | INFO: Informed scheduler that task DownloadData_wine_dataset_False__usr_share_data__79bc385f2e has status PENDING
orchestrator_1 | INFO: Done scheduling tasks
orchestrator_1 | INFO: Running Worker with 1 processes
orchestrator_1 | DEBUG: Asking scheduler for work...
orchestrator_1 | DEBUG: Pending tasks: 1
orchestrator_1 | INFO: [pid 1] Worker Worker(salt=932162234, workers=1, host=793338ce4678, username=root, pid=1) running DownloadData(no_remove_finished=False, fname=wine_dataset, out_dir=/usr/share/data/raw/, url=https://github.com/datarevenue-berlin/code-challenge-2019/releases/download/0.1.0/dataset_sampled.csv)
orchestrator_1 | INFO: INFO:download-data:Downloading dataset
orchestrator_1 | INFO: INFO:download-data:Will write to /usr/share/data/raw/wine_dataset.csv
orchestrator_1 | INFO: [pid 1] Worker Worker(salt=932162234, workers=1, host=793338ce4678, username=root, pid=1) done DownloadData(no_remove_finished=False, fname=wine_dataset, out_dir=/usr/share/data/raw/, url=https://github.com/datarevenue-berlin/code-challenge-2019/releases/download/0.1.0/dataset_sampled.csv)
orchestrator_1 | DEBUG: 1 running tasks, waiting for next task to finish
orchestrator_1 | INFO: Informed scheduler that task DownloadData_wine_dataset_False__usr_share_data__79bc385f2e has status DONE
orchestrator_1 | DEBUG: Asking scheduler for work...
orchestrator_1 | DEBUG: Done
orchestrator_1 | DEBUG: There are no more tasks to run at this time
orchestrator_1 | INFO: Worker Worker(salt=932162234, workers=1, host=793338ce4678, username=root, pid=1) was stopped. Shutting down Keep-Alive thread
orchestrator_1 | INFO:
orchestrator_1 | ===== Luigi Execution Summary =====
orchestrator_1 |
orchestrator_1 | Scheduled 1 tasks of which:
orchestrator_1 | * 1 ran successfully:
orchestrator_1 | - 1 DownloadData(no_remove_finished=False, fname=wine_dataset, out_dir=/usr/share/data/raw/, url=https://github.com/datarevenue-berlin/code-challenge-2019/releases/download/0.1.0/dataset_sampled.csv)
orchestrator_1 |
orchestrator_1 | This progress looks :) because there were no failed tasks or missing dependencies
orchestrator_1 |
orchestrator_1 | ===== Luigi Execution Summary =====
orchestrator_1 |
code-challenge-2020_orchestrator_1 exited with code 0

Thank you

Partial Submission .zip Year Change

Very minor, but the naming schema instructions still say to name it cc19_<first_name>_<last_name>.zip instead of cc20_<first_name>_<last_name>.zip.