Coder Social home page Coder Social logo

duke-gcb / calrissian Goto Github PK

View Code? Open in Web Editor NEW
41.0 6.0 13.0 1.19 MB

CWL on Kubernetes

Home Page: https://duke-gcb.github.io/calrissian/

License: MIT License

Common Workflow Language 2.02% Shell 0.91% Python 96.77% Dockerfile 0.29%
cwl cwl-workflow cwl-workflows kubernetes

calrissian's Introduction

Calrissian

CWL on Kubernetes

Build Workflow

Overview

Calrissian is a CWL implementation designed to run inside a Kubernetes cluster. Its goal is to be highly efficient and scalable, taking advantage of high capacity clusters to run many steps in parallel.

Cluster Requirements

Calrissian requires a Kubernetes or Openshift/OKD cluster, configured to provision PersistentVolumes with the ReadWriteMany access mode. Kubernetes installers and cloud providers don't usually include this type of storage, so it may require additional configuration.

Calrissian has been tested with NFS using the nfs-client-provisioner and with GlusterFS using OKD Containerized GlusterFS. Many cloud providers have an NFS offering, which integrates easily using the nfs-client-provisioner.

Scalability / Resource Requirements

Calrissian is designed to issue tasks in parallel if they are independent, and thanks to Kubernetes, should be able to run very large parallel workloads.

When running calrissian, you must provide a limit the the number of CPU cores (--max-cores) and RAM megabytes (--max-ram) to use concurrently. Calrissian will use CWL ResourceRequirements to track usage and stay within the limits provided. We highly recommend using accurate ResourceRequirements in your workloads, so that they can be scheduled efficiently and are less likely to be terminated or refused by the cluster.

calrissian parameters can be provided via a JSON configuration file either stored under ~/.calrissian/default.json or provided via the --conf option.

Below an example of such a file:

{
    "max_ram": "16G",
    "max_cores": "10",
    "outdir": "/calrissian",
    "tmpdir_prefix": "/calrissian/tmp"
}

CWL Conformance

Calrissian leverages cwltool heavily and most conformance tests for CWL v1.0. Please see conformance for further details and processes.

To view open issues related to conformance, see the conformance label on the issue tracker.

Setup

Please see examples for installation and setup instructions.

Environment Variables

Calrissian's behaviors can be customized by setting the following environment variables in the container specification.

Pod lifecycle

By default, pods for a job step will be deleted after termination

  • CALRISSIAN_DELETE_PODS: Default true. If false, job step pods will not be deleted.

Kubernetes API retries

When encountering a Kubernetes API exception, Calrissian uses a library to retry API calls with an exponential backoff. See the tenacity documentation for details.

  • RETRY_MULTIPLIER: Default 5. Unit for multiplying the exponent interval.
  • RETRY_MIN: Default 5. Minimum interval between retries.
  • RETRY_MAX: Default 1200. Maximum interval between retries.
  • RETRY_ATTEMPTS: Default 10. Max number of retries before giving up.

calrissian's People

Contributors

dependabot[bot] avatar dleehr avatar emmanuelmathot avatar fabricebrito avatar johnbradley avatar mr-c avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

calrissian's Issues

CWL 1.0 CT103: Test dockerOutputDirectory

Got workflow error
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/cwltool/executors.py", line 264, in runner
    job.run(runtime_context)
  File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 553, in run
    self.check_requirements()
  File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 390, in check_requirements
    raise UnsupportedRequirement('Error: feature {}.{} is not supported'.format(feature, field))
cwltool.errors.UnsupportedRequirement: Error: feature DockerRequirement.dockerOutputDirectory is not supported
Workflow or tool uses unsupported feature:
Error: feature DockerRequirement.dockerOutputDirectory is not supported

Templatize openshift/k8s files

Provide a mechanism for installing calrissian into a different project/namespace. We could conceivably use openshift templates here, helm, or a simple sed script.

On the cluster I was testing the calrissian namespace was already taken. I created calrissian2 but had to change namespace a bunch of places.

Originally posted by @johnbradley in #16

Support additional DockerRequirements where possible

This method assumes a DockerRequirement with dockerPull, and will fail otherwise:

def _get_container_image(self):
# we only use dockerPull here
# Could possibly make an API call to kubernetes to check for the image there, but that's not important right now
(docker_req, docker_is_req) = self.get_requirement("DockerRequirement")
return str(docker_req["dockerPull"])

Related to #7, which would inject a DockerRequirement prior to this call

CWL 1.0 CT123: Test that expression engine does not fail to evaluate reference to self with unprovided input

Test [123/128] Test that expression engine does not fail to evaluate reference to self with unprovided input

Final process status is permanentFail
Test 123 failed: /usr/local/bin/calrissian --max-ram 8G --max-cores 4 --default-container debian:stretch-slim --outdir=/output/tmp4xzk7g0w --quiet v1.0/stage-unprovided-file.cwl v1.0/empty.json
Test that expression engine does not fail to evaluate reference to self with unprovided input
Returned non-zero

CWL 1.0 CT122: Test that boolean flags do not appear on command line if inputBinding is empty and not null

Test [122/128] Test that boolean flags do not appear on command line if inputBinding is empty and not null

Final process status is permanentFail
Test 122 failed: /usr/local/bin/calrissian --max-ram 8G --max-cores 4 --default-container debian:stretch-slim --outdir=/output/tmpipcpuvsd --quiet v1.0/bool-empty-inputbinding.cwl v1.0/bool-empty-inputbinding-job.json
Test that boolean flags do not appear on command line if inputBinding is empty and not null
Returned non-zero

Run CWL conformance tests

CWL provides a suite of conformance tests to demonstrate how an implementation conforms to the spec: https://github.com/common-workflow-language/cwltool/blob/master/cwltool/schemas/CONFORMANCE_TESTS.md

I've started to explore how to run these:

python -m venv venv
source venv/bin/activate
pip install calrissian cwltest
cd cwltool/cwltool/schemas
git clone [email protected]:common-workflow-language/cwltool.git
./run_test.sh RUNNER=calrissian -n1 --verbose --EXTRA="--max-ram 16G --max-cores 8"

Above attempts to run the first test, but fails:

$ ./run_test.sh RUNNER=calrissian -n1 --verbose --EXTRA="--max-ram 16G --max-cores 8"
./run_test.sh: line 69: eval: --: invalid option
eval: usage: eval [arg ...]
--- Running conformance test v1.0 on /Users/dcl9/Code/python/calrissian-conformance/venv/bin/calrissian ---
calrissian 0.5.0 (cwltool 1.0.20181217162649)
cwltest --tool /Users/dcl9/Code/python/calrissian-conformance/venv/bin/calrissian --test=conformance_test_v1.0.yaml -n1 --verbose --basedir /Users/dcl9/Code/python/calrissian-conformance/cwltool/cwltool/schemas/v1.0 --
Test [1/128] General test of command line generation
Test 1 failed: /Users/dcl9/Code/python/calrissian-conformance/venv/bin/calrissian --outdir=/var/folders/7t/q8bb2np92p5b26pk_h5lr0lh0000gn/T/tmp3sqc2xvb --quiet v1.0/bwa-mem-tool.cwl v1.0/bwa-mem-job.json
General test of command line generation
Returned non-zero

0 tests passed, 1 failures, 0 unsupported features

1 tool tests failed

Notes:

  • The first failure is due to the EXTRA option. I'm not sure if it's being used correctly
  • Tests likely need to run inside a kubernetes cluster so that local path lookups are correct and data is available
  • Should look to other CI examples that run these conformance tests for how they are run automatically.

Error running CommandLineTool due to container name

Using calrissian from image dukegcb/calrissian:0.2.1 I received the following error:

Workflow error, try again with --debug for more information:
(422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-store', 'Content-Type': 
'application/json', 'Date': 'Thu, 28 Feb 2019 19:08:52 GMT', 'Content-Length': '888'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":
"Pod \"pair-wc-packed.cwl-pod-xftmzyvq\" is invalid: spec.containers[0].name: Invalid value: 
\"pair-wc-packed.cwl-container\": a DNS-1123 label must consist of lower case alphanumeric 
characters or '-', and must start andend with an alphanumeric character (e.g. 'my-name',  or
 '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')","reason":"Invalid",
"details":{"name":"pair-wc-packed.cwl-pod-xftmzyvq","kind":"Pod","causes":
[{"reason":"FieldValueInvalid","message":"Invalid value: \"pair-wc-packed.cwl-container\": 
a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and 
end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for 
validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')","field":"spec.containers[0].name"}]},"code":422}

I was using this workflow: https://raw.githubusercontent.com/johnbradley/toyworkflow/master/pair-wc-packed.cwl

Upgrade PyYAML

The patched version of PyYAML has been released, but requirements.txt still specifies 3.13.

This requirement is a result of pip install kubernetes in 5f343ad, so we'll need to upgrade kubernetes, whenever they adopt PyYAML 5

Tracking of allocated resources may go negative

After a recent attempt to run a larger job with some failures, the allocation counts got to be negative:

run_job: Allocated 9/120.0 CPU cores, 38000/257698.03776 MB RAM
wait_for_next_completion: Allocated 9/120.0 CPU cores, 38000/257698.03776 MB RAM
[step map_13] completed success
shutil.rmtree(/tmp/skha43nv, True)
shutil.rmtree(/tmp/fl5b_1h3, True)
run_job: Allocated -3/120.0 CPU cores, 22000/257698.03776 MB RAM
wait_for_next_completion: Allocated -3/120.0 CPU cores, 22000/257698.03776 MB RAM
[step sort_11] completed success
shutil.rmtree(/tmp/kxgp070z, True)
shutil.rmtree(/tmp/swxj439s, True)
run_job: Allocated -4/120.0 CPU cores, 17000/257698.03776 MB RAM
wait_for_next_completion: Allocated -4/120.0 CPU cores, 17000/257698.03776 MB RAM
[step sort_15] completed success
shutil.rmtree(/tmp/d3iyf495, True)
shutil.rmtree(/tmp/sj80vchm, True)
run_job: Allocated -5/120.0 CPU cores, 12000/257698.03776 MB RAM
wait_for_next_completion: Allocated -5/120.0 CPU cores, 12000/257698.03776 MB RAM
[step sort_14] completed success
shutil.rmtree(/tmp/0l7a2ou2, True)
shutil.rmtree(/tmp/lgfgj73u, True)
run_job: Allocated -6/120.0 CPU cores, 7000/257698.03776 MB RAM
wait_for_next_completion: Allocated -6/120.0 CPU cores, 7000/257698.03776 MB RAM
[step sort] completed success
shutil.rmtree(/tmp/xxvilo3g, True)
shutil.rmtree(/tmp/ttdwcuqk, True)
run_job: Allocated -7/120.0 CPU cores, 2000/257698.03776 MB RAM
wait_for_next_completion: Allocated -7/120.0 CPU cores, 2000/257698.03776 MB RAM
[step sort_3] completed success
shutil.rmtree(/tmp/8r8f5oxi, True)
shutil.rmtree(/tmp/a8f491ya, True)
run_job: Allocated -8/120.0 CPU cores, -3000/257698.03776 MB RAM
wait_for_next_completion: Allocated -8/120.0 CPU cores, -3000/257698.03776 MB RAM
[step mark_duplicates_7] completed success
shutil.rmtree(/tmp/xl6e42es, True)
shutil.rmtree(/tmp/1b5jzgi5, True)
run_job: Allocated -9/120.0 CPU cores, -8000/257698.03776 MB RAM
wait_for_next_completion: Allocated -9/120.0 CPU cores, -8000/257698.03776 MB RAM
[step map_6] completed success
shutil.rmtree(/tmp/yzgwopzd, True)
shutil.rmtree(/tmp/fc6k6ybw, True)
run_job: Allocated -21/120.0 CPU cores, -24000/257698.03776 MB RAM
wait_for_next_completion: Allocated -21/120.0 CPU cores, -24000/257698.03776 MB RAM
[step map_23] completed success
shutil.rmtree(/tmp/4w6pygny, True)
shutil.rmtree(/tmp/ra3yze12, True)
run_job: Allocated -22/120.0 CPU cores, -29000/257698.03776 MB RAM
Workflow cannot make any more progress.
Final process status is permanentFail

Job hangs on ContainerCreating if volume specified twice

A single volume may be mounted twice, but listing the volume itself twice causes the container to hang.

Succeeds:

apiVersion: batch/v1
kind: Job
metadata:
  name: single-mount-twice
spec:
  template:
    spec:
      containers:
      - args: [date > /mount-double/double.txt]
        command: [/bin/sh, -c]
        image: debian:stretch-slim
        name: single-mount-twice
        volumeMounts:
        - {mountPath: /mount-single, name: mount-single, subPath: mount-single}
        - {mountPath: /mount-double, name: mount-single, subPath: mount-double}
      restartPolicy: Never
      volumes:
      - name: mount-single
        persistentVolumeClaim:
          claimName: multimount

Fails:

apiVersion: batch/v1
kind: Job
metadata:
  name: double-mount
spec:
  template:
    spec:
      containers:
      - args: [date > /mount-double/double.txt]
        command: [/bin/sh, -c]
        image: debian:stretch-slim
        name: double-mount
        volumeMounts:
        - {mountPath: /mount-single, name: mount-single, subPath: mount-single}
        - {mountPath: /mount-double, name: mount-double, subPath: mount-double}
      restartPolicy: Never
      volumes:
      - name: mount-single
        persistentVolumeClaim:
          claimName: multimount
      - name: mount-double
        persistentVolumeClaim:
          claimName: multimount

Add storage reporting to usage reporting

Calrissian will report CPU and memory usage. Storage is another important resource, and we have an issue Duke-GCB/lando#141 to better estimate these volumes.

Let's add a field to the reports that records the size of files generated at each step so that we can better size the intermediate (tmpout) volume, and perhaps /tmp too?

Memory leaks

When running now with openshift metrics enabled, I notice the memory usage very high in calrissian. It seems to increase by several GB between running pods, and stays level during job execution

Screen Shot 2019-05-09 at 3 57 21 PM

Other notes:

Also the calrissian job pod is using 15GB. It started using 4GB, then doubled to 8GB, and now is using 16GB

- top says its all under calrissian process and not nodejs or other stuff
- currently using about 20% CPU. it might be checksumming but slowly

schema-salad LockFailed when running under Docker

When running in Docker under openshift, the following error appears at the start of a calrissian run:

Could not load extension schema https://schema.org/docs/schema_org_rdfa.html: ('https://schema.org/docs/schema_org_rdfa.html', LockFailed('failed to create /.cache/salad/b/2/8/9/0/calrissian-gatk4-preprocessing-pkg-7rxg9-55dae700.13435938927651500254',))
Could not load extension schema https://schema.org/docs/schema_org_rdfa.html: ('https://schema.org/docs/schema_org_rdfa.html', LockFailed('failed to create /.cache/salad/b/2/8/9/0/calrissian-gatk4-preprocessing-pkg-7rxg9-55dae700.13435938927651500254',))

I suspect that schema-salad's home-dir lookup for caching is involved, and setting an environment variable on the pod would fix it.

Clean up pods when killed

Related to #38, but deleting a Calrissian job will not automatically delete the pods it submitted to the API.

We should clean these up when the process is killed. Suggestion: catch SIGTERM and delete any submitted pods.

Changing the top-level job command to ['python','-m','calrissian.main'... fails to find nodejs

The current job specification has a command that runs run.sh in bash:

command: ["/bin/bash", "run.sh"]

And the CWL engine runs just fine.

I attempted to change this command in the YAML to remove the intermediate shell script:

command: ["python", "-m", "calrissian.main", "--tmpdir-prefix", "/calrissian/tmptmp/", "--tmp-outdir-prefix", "/calrissian/tmpout", "--outdir", "/calrissian/output-data", "/calrissian/input-data/revsort-array.cwl", "/calrissian/input-data/revsort-array-job.json"]

But the CWL process fails:

cwltool.sandboxjs.JavascriptException: cwltool requires Node.js engine to evaluate and validate Javascript expressions, but couldn't find it.  Tried nodejs, node, docker run node:slim

If I debug the failed container, node is present and available:

screen shot 2019-01-14 at 11 34 26 am

I think this is available to shells but not if the container command is bash. Something to watch out for.

Errors when not including trailing slash in input tmpdir params

--tmp-outdir-prefix

When --tmp-outdir-prefix /calrissian/tmpout is used the following error occurs:

...
Unhandled error, try again with --debug for more information:
--
  | [Errno 13] Permission denied: '/calrissian/tmpoutocot8g90'

I believe this should have been /calrissian/tmpout/ocot8g90

--tmpdir-prefix

When --tmpdir-prefix /calrissian/tmptmp is used the following error occurs:

Unexpected exception
--
  | Traceback (most recent call last):
  | File "/opt/app-root/lib/python3.6/site-packages/cwltool/workflow.py", line 793, in job
  | runtimeContext):
  | File "/opt/app-root/lib/python3.6/site-packages/cwltool/command_line_tool.py", line 500, in job
  | tempfile.mkdtemp(prefix=tmpdir_prefix)  # type: ignore
  | File "/opt/app-root/lib64/python3.6/tempfile.py", line 368, in mkdtemp
  | _os.mkdir(file, 0o700)
  | PermissionError: [Errno 13] Permission denied: '/calrissian/tmptmp9z0_79if'
  | Cannot make scatter job: [Errno 13] Permission denied: '/calrissian/tmptmp9z0_79if'
  | Workflow cannot make any more progress.

I believe this should have been /calrissian/tmptmp/9z0_79if

CWL 1.0 CT68: Test that second expression in concatenated valueFrom is not ignored

Test [68/128] Test that second expression in concatenated valueFrom is not ignored

Test 68 failed: /usr/local/bin/calrissian --max-ram 8G --max-cores 4 --default-container debian:stretch-slim --outdir=/output/tmps26ft6ts --quiet v1.0/vf-concat.cwl v1.0/empty.json
Test that second expression in concatenated valueFrom is not ignored
Compare failure expected: {
    "out": "a string\n"
}
got: {
    "out": "\n"
}
caused by: failed comparison for key 'out': expected: "a string\n"
got: "\n"

Clean up logging and output from pods and calrissian process

Currently a JSON object is printed to stdout which contains details about the files created in the job.
This data can be see in the logs but it is mixed with stderr.
These two data(stdout and stderr) should be persisted is some way outside of the job logs.

CWL 1.0 CT89: Test file literal as input

Test [89/128] Test file literal as input

Got workflow error
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/cwltool/executors.py", line 264, in runner
    job.run(runtime_context)
  File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 554, in run
    pod = self.create_kubernetes_runtime(runtimeContext) # analogous to create_runtime()
  File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 429, in create_kubernetes_runtime
    any_path_okay=True)
  File "/usr/local/lib/python3.6/site-packages/cwltool/job.py", line 575, in add_volumes
    runtime, vol, host_outdir_tgt, secret_store, tmpdir_prefix)
  File "/usr/local/lib/python3.6/site-packages/cwltool/job.py", line 537, in create_file_and_add_volume
    writable=writable)
  File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 579, in append_volume
    raise NotImplementedError('append_volume')
NotImplementedError: append_volume
Workflow error, try again with --debug for more information:
append_volume
Test 89 failed: /usr/local/bin/calrissian --max-ram 8G --max-cores 4 --default-container debian:stretch-slim --outdir=/output/tmp6g8tvjsk --quiet v1.0/cat3-tool.cwl v1.0/file-literal.yml
Test file literal as input
Returned non-zero

Consider changing from creating Jobs to Pods

Calrissian creates a Kubernetes Job for each CWL CommandLineTool job to run. The KubernetesJobBuilder.build method builds the single-container definition.

However, to determine the CWL CommandLineTool's exit code, #21 makes a change to inspect each k8s Job's first Pod, and includes some code to check that the job had 1 pod and the pod had 1 container.

After implementing this, I'm not sure the k8s Job is what we should be submitting. Jobs are controllers in k8s and can have multiple replicas or cron behaviors. Additionally, k8s will restart a Job's pod(s) on a different node if it doesn't complete once. I don't think we need that layer of management, since calrissian is already watching/managing those details.

tl;dr: can we update calrissian to submit Pods instead of Jobs? It would probably simplify/streamline the code that's here.

Tests for add_writable_directory_volume

I wrote unit tests in #13 but skipped over this method. We don't plan to use this feature right now, and the code written is already tested from CWL. Return to this when we can test the actual feature rather than blindly writing a unit test in the mold of the existing code.

CWL 1.0 CT119: Test file literal as input without Docker

Test [119/128] Test file literal as input without Docker

Got workflow error
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/cwltool/executors.py", line 264, in runner
    job.run(runtime_context)
  File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 554, in run
    pod = self.create_kubernetes_runtime(runtimeContext) # analogous to create_runtime()
  File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 429, in create_kubernetes_runtime
    any_path_okay=True)
  File "/usr/local/lib/python3.6/site-packages/cwltool/job.py", line 575, in add_volumes
    runtime, vol, host_outdir_tgt, secret_store, tmpdir_prefix)
  File "/usr/local/lib/python3.6/site-packages/cwltool/job.py", line 537, in create_file_and_add_volume
    writable=writable)
  File "/usr/local/lib/python3.6/site-packages/calrissian/job.py", line 579, in append_volume
    raise NotImplementedError('append_volume')
NotImplementedError: append_volume
Workflow error, try again with --debug for more information:
append_volume
Test 119 failed: /usr/local/bin/calrissian --max-ram 8G --max-cores 4 --default-container debian:stretch-slim --outdir=/output/tmpb66gmonw --quiet v1.0/cat3-nodocker.cwl v1.0/file-literal.yml
Test file literal as input without Docker
Returned non-zero

Clean up example openshift yaml

To aid development, I created a DeploymentConfig to build this application from source and run it like a web app. Then, to try running a workflow, I oc rsh into the pod and invoke things there.

So this issue exists to remind us of this pattern and clean it up when we have something better.

Error running long workflow

Using the docker image dukegcb/calrissian:0.3.1 I received the following error when running the exomeseq-gatk4-preprocessing.cwl workflow.

Traceback (most recent call last):
  File "/usr/local/bin/calrissian", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/calrissian/main.py", line 83, inmain
    write_report(parsed_args.usage_report)
  File "/usr/local/lib/python3.6/site-packages/calrissian/report.py", line 343,in write_report
    json.dump(Reporter.get_report().to_dict(), f, indent=4, default=default_serializer)
  File "/usr/local/lib/python3.6/site-packages/calrissian/report.py", line 290,in to_dict
    result = super(TimelineReport, self).to_dict()
  File "/usr/local/lib/python3.6/site-packages/calrissian/report.py", line 39, in to_dict
    result['elapsed_hours'] = self.elapsed_hours()
  File "/usr/local/lib/python3.6/site-packages/calrissian/report.py", line 35, in elapsed_hours
    return self.elapsed_seconds() / SECONDS_PER_HOUR
  File "/usr/local/lib/python3.6/site-packages/calrissian/report.py", line 27, in elapsed_seconds
    delta = self.finish_time - self.start_time
TypeError: unsupported operand type(s) for -: 'NoneType' and 'NoneType'

Command was:

calrissian --default-container debian:stretch-slim --tmp-outdir-prefix /bespin/tmpout/ \
  --outdir /bespin/output-data/results/ --max-ram 120G --max-cores 52 \
  --usage-report /bespin/output-data/job-94-jpb67-resource-usage.json \
  /bespin/job-data/workflow/exomeseq-gatk4-preprocessing.cwl /bespin/job-data/job-order.json

Error when mounted volumes ending in slash

When an input data persistent volume is mounted with a trailing slash this causes files to be incorrectly mounted into the pods created by calrissian.

This results in the child pods having an error:

'terminated': {'container_id': 'docker://<someid>',
--
  | 'exit_code': 1,
...
  | 'message': None,
  | 'reason': 'Error',
...

Reproducing

If you change this line in CalrissianJob-revsort.yaml to end in a slash:

    - mountPath: /calrissian/input-data/

This line seems to be assuming there is an extra / after the mountPath:

source_without_prefix = source[len(prefix) + 1:]

This results in incorrect files being mounted for example hello1.txt:

volumeMounts:
--
...
  | - {mountPath: /var/lib/cwl/stge5759f63-f4d0-401b-9e06-d7efaf517e87/hello1.txt,
  | name: calrissian-input-data, readOnly: true, subPath: ello1.txt}

KubernetesClient.follow_logs() exits before getting all logs

follow_logs was added in #59, and calls read_namespaced_pod_log() with follow=True.

In my tests of the GATK4 preprocessing workflow, I can see that the generator returned by stream() finishes while the pod is still running.

logs from calrissian job:

$ oc logs --timestamps=True job/calrissian-gatk4-preprocessing |grep trim-pod
2019-03-14T18:52:27.643057398Z [trim-pod-klhihztw] 2019-03-14T18:52:28.377254055Z   >>> Now performing quality (cutoff 20) and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file /var/lib/cwl/stg5811d013-3646-40db-9308-efdf548c1776/SA05051-R2.fastq.gz <<<
2019-03-14T18:58:30.490922326Z [trim-pod-klhihztw] 2019-03-14T18:58:31.219549578Z 10000000 sequences processed
2019-03-14T19:04:36.760504315Z [trim-pod-klhihztw] 2019-03-14T19:04:37.48981895Z 20000000 sequences processed
2019-03-14T19:10:50.103854762Z [trim-pod-klhihztw] 2019-03-14T19:10:50.83159839Z 30000000 sequences processed
2019-03-14T19:17:27.807121329Z [trim-pod-klhihztw] 2019-03-14T19:17:28.540569252Z 40000000 sequences processed
2019-03-14T19:22:32.967763539Z [trim-pod-klhihztw] follow_logs end

logs directly from the pod

$ oc logs --timestamps trim-pod-klhihztw
2019-03-14T18:52:28.377254055Z   >>> Now performing quality (cutoff 20) and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file /var/lib/cwl/stg5811d013-3646-40db-9308-efdf548c1776/SA05051-R2.fastq.gz <<< 
2019-03-14T18:58:31.219549578Z 10000000 sequences processed
2019-03-14T19:04:37.48981895Z 20000000 sequences processed
2019-03-14T19:10:50.83159839Z 30000000 sequences processed
2019-03-14T19:17:28.540569252Z 40000000 sequences processed
2019-03-14T19:22:33.692899536Z This is cutadapt 1.14 with Python 3.5.2
2019-03-14T19:22:33.693016208Z Command line parameters: -f fastq -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC /var/lib/cwl/stg5811d013-3646-40db-9308-efdf548c1776/SA05051-R2.fastq.gz
2019-03-14T19:22:33.693031811Z Trimming 1 adapter with at most 10.0% errors in single-end mode 

The pod was running for about an hour when follow_logs exited:

2019-03-14T18:19:44.408873685Z k8s pod 'trim-pod-klhihztw' started
2019-03-14T18:19:49.311305246Z [trim-pod-klhihztw] follow_logs start
...
2019-03-14T19:22:32.967763539Z [trim-pod-klhihztw] follow_logs end

Report resource usage by step and overall

Our job runner should be able to report how long each step ran for and what resources it used. This will be critical for accounting but could also be used for real-time progress reporting.

@johnbradley and i discussed recently how to accomplish this, and I suggested it would be best addressed by calrissian

Some thoughts:

  • My initial idea was to address progress reporting and resource usage in the same mechanism - by watching for jobs issued that match a label or annotation.
  • lando watcher watches k8s events and translates them into rabbitmq messages. This would be useful for reporting real-time updates back to bespin-api through the existing queue.
  • Using lando watcher for that purpose is a stretch because lando and bespin aren't concerned with the steps of a CWL workflow or number of scattered tasks.
  • Calrissian (as the CWL engine) obviously is concerned with these progress details, so it makes sense to digest and broadcast this information from calrissian
  • Calrissian can easily track and report these times in a single JSON file at the end of the execution. This is a good feature for all users, not just bespin/lando integration
  • Reporting real-time progress is trickier, as this would likely need to move through the message queue.

child pods stuck creating when using read only gcePersistentDisk

When running within google cloud k8s engine the pods created by calrissian to run various steps get stuck in the ContainerCreating state. The Events for these pods show
AttachVolume.Attach failed for volume "<...gce pd name...> " : googleapi: Error 400: RESOURCE_IN_USE_BY_ANOTHER_RESOURCE - The disk resource '...gce pd name...' is already being used by '..k8s node..'

I think this might be fixed by declaring the persistentVolumeClaim as readOnly as described here:
kubernetes/kubernetes#67313 (comment)

Clarify when --tmpdir-prefix is necessary

More context in #30

In #62, I changed the behavior for each container to get a local emptyDir volume mounted as /tmp. This means that the calrissian/cwltool generated tmpdir provided to the CalrissianCommandLineJob object usually goes unused and does not need to be located on a persistent volume. Prior to this change, a job's tmpdir would be a subdirectory inside of --tmpdir-prefix, so we required that that be on a persistent volume that could be mounted by the container.

After the change, that behavior no longer applies. The /tmp directory inside a container is local to that container, and should improve performance (since /tmp is no longer a network share)

However, there are still cases that would require a --tmpdir-prefix to be provided to calrissian. Specifically, tools that use the InitialWorkDirRequirement with writable files or directories. This feature is implemented in calrissian by making a copy of the original required file into the job's tmpdir (from the calrissian python process). So, that location must be a place that calrissian can write it and the container can modify it. I assume tmpdir was chosen since this this writable file is temporary and isolated:

If true, the file or directory must be writable by the tool. Changes to the file or directory must be isolated and not visible by any other CommandLineTool process. This may be implemented by making a copy of the original file or directory.

Correctly check job result in finish method

Currently the finish method always reports 'success':

def finish(self):
#TODO Check the results for real and clean-up
status = 'success'
# collect_outputs (and collect_output) is definied in command_line_tool
outputs = self.collect_outputs(self.outdir)
self.output_callback(outputs, status)

This should actually inspect the result of the kubernetes job and report failure

Support multiple tmpout volumes

CWL uses a tmpout volume to store intermediate results between each step of a workflow.
Right now we are using a ReadWriteMany volume to share data between the pods running each step.
Look into creating multiple volumes with ReadWriteOnce instead of a single large ReadWriteMany to improve performance.

maintain versions of calrissian

We should be able to re-run workflows with the same version of calrissian.
To support multiple k8s clouds we will need a central location for storing calrissian images.
For example dockerhub.

Improve quote handling in command-line

In attempts to run exomeseq-gatk4-preprocessing, the map step fails because the command is not constructed correctly.

Relevant snippet from generated spec:

  containers:
    - args:
        - >-
          bash bwa-mem-samtools.sh -R
          @RG\tID:SA05051\tLB:F012354\tPL:Illumina\tPU:SA05051\tSM:SA05051 -t 20
          /var/lib/cwl/stgc9814adc-1982-403b-ade3-e106d7a82960/human_g1k_v37_decoy.fasta
          /var/lib/cwl/stgec2e16b0-4425-4dcc-8baa-dd9b5bef0451/SA05051-R1_val_1.fq.gz
          /var/lib/cwl/stg4f3a9d6b-7759-40fb-8070-ffecf1403d45/SA05051-R2_val_2.fq.gz
          > SA05051-mapped.bam
      command:
        - /bin/sh
        - '-c'
      env:
        - name: HOME
          value: /plpTRq
        - name: TMPDIR
          value: /tmp
      image: 'broadinstitute/genomes-in-the-cloud:2.3.1-1512499786'

I believe that the base cwltool job would catch this with

https://github.com/common-workflow-language/cwltool/blob/bbe20f54deea92d9c9cd38cb1f23c4423133d3de/cwltool/job.py#L243

but this feature is still outstanding in #8

Dynamically connect/detect persistent volume paths and claims

Currently, the job builder assumes 4 specific PVCs and their mount points

def populate_demo_values(self):
# TODO: fetch these from the kubernetes API since they are attached to this pod
self.add_persistent_volume_entry('/calrissian/input-data', 'calrissian-input-data')
self.add_persistent_volume_entry('/calrissian/output-data', 'calrissian-output-data')
self.add_persistent_volume_entry('/calrissian/tmptmp', 'calrissian-tmp')
self.add_persistent_volume_entry('/calrissian/tmpout', 'calrissian-tmpout')

These could either be dynamically looked up through the k8s API, or provided in a ConfigMap

Not terribly urgent if we continue the convention, but it is a kludge.

Implement resource allocation in executor (and jobs)

Current executor is a thin subclass of MultithreadedJobExecutor, so it schedules/accounts jobs based on the local machine's CPU and RAM.

Obviously the local machine's resources should not be used for jobs in a cluster

Additionally, the jobs created should be able to include the requested resources for accurate scheduling.

Document how to run under minishift

Detail how calrissian can be used with minishift.
Two issues that I am aware of are built image url differences and a persistent volume limitation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.