datashim-io / datashim Goto Github PK

View Code? Open in Web Editor NEW

455.0 455.0 65.0 11.36 MB

A kubernetes based framework for hassle free handling of datasets

Home Page: http://datashim-io.github.io/datashim

License: Apache License 2.0

Makefile 8.51% Dockerfile 2.43% Shell 6.57% Go 81.61% Smarty 0.12% Python 0.75%

csi dataset-lifecycle-framework kubernetes nfs noobaa s3

datashim's People

Contributors

Stargazers

Watchers

Forkers

stefano81 galsagie aland-zhang pkoutsovasilis bhaskers-blu-org1 davidyuyuan vpavlin pachyderm yiannisgkoufas ckadner pkoutsov tomcli dano-inc bubnicbf dimss geminiagaloos malvag frederic-milcent raj-prasanna anthonyhaussman shxinding seb-835 colorstheforce mauroribeirohv ibrahimng olevski drfranknlee srikumar003 chazapis swissdatasciencecenter windfarer antonismakris applechaning johngouf gdubya ktaf gavinljj dokuboyejo mgazz vassilisvassiliadis michael-johnston collinarnett captainpatate capoolebugchat samavedulark gt-sk-1654 jianpingzhangbill alessandropomponio lukemarsden james-gonzalez grglzrv wahello christian-pinto erdal-pb winstonccc thanhmanci foofybuster muandane mans2singh lgy1027 fcollonval datalayer-externals

datashim's Issues

Consider changing the labels format

The current label format does not follow the best practices I have been seeing around (https://github.com/IBM/dataset-lifecycle-framework/blob/master/examples/hive/sampleapp/samplepod.yaml#L8)

Have you considered prefixing your labels similarly to https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/

apiextensions.k8s.io/v1beta1

Fix goofys disconnection in case of error

Hi,
goofys needs syslog if fatal event happens. Installing package netcat-openbsd and running nc -k -l -U /var/run/syslog & fixes the issue and enpoint will not disconnect. I think this image quay.io/k8scsi/csi-node-driver-registrar:v1.2.0 needs the fix. (source)

Use datashim for helm with terraform

Hi,

I'd like to integrate your awesome project into my terraform script, using helm. I'm kind of a beginner with helm, so I was wondering if you could explain to me how to add the datashim charts as a repo. As far as I understand, it requires an index.yaml which I cannot find in the charts.
I could install it with kubectl and the yaml file, but I'd like to exclude the efs driver as I do not need it and I don't want it to waste resources.

If you intentionally didn't add an index.yaml, could you please point me into the right direction of how to handle this? Creating my own index.yaml? Thanks a lot in advance!

Bug on dataset operator eviction

If the dataset operator pod gets evicted for some reason a new instance is started. However, the evicted instance seems to be holding a lock causing a deadlock.

In the new operator instance logs you can only see the following line repeated indefinitely:
{"level":"info","ts":1589796411.546784,"logger":"leader","msg":"Not the leader. Waiting."}
The pod is not capable of continuing its execution.

This seems to be an issue with the operator-framework that is being triaged at the below link:
operator-framework/operator-sdk#1305

error 400 Bad Request

Hi all,

I have a Kubernetes cluster on AWS (EKS).
we are currently using some workaround (script at the init of a node) to be able to mount s3 bucket on pod.

I tried to use datashim which looks very promising.
I installed the setup with https://raw.githubusercontent.com/IBM/dataset-lifecycle-framework/master/release-tools/manifests/dlf.yaml

here my dataset config:

apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: archive-dataset
spec:
  local:
    type: "COS"
    accessKeyID: "XXX"
    secretAccessKey: "XXX"
    endpoint: "https://s3.amazonaws.com"
    bucket: "bucket_name-ap-east-1"
    region: "ap-east-1"

But I end up with the error:

  Warning  ProvisioningFailed    3m11s (x9 over 7m32s)  ch.ctrox.csi.s3-driver_csi-provisioner-s3-0_0ef0ce8b-2b1e-4a4e-8ebd-a82731d7ae1b  failed to provision volume with StorageClass "csi-s3": rpc error: code = Unknown desc = failed to check if bucket bucket_name-ap-east-1 exists: 400 Bad Request

All the pod in dlf namespace are running fine (Running status, I didn't dig the log yet)
I tried with different credentials.
I can mount successfully the bucket locally with s3fs or goofys (with the same credentials).

Did I miss anything?
Thank you very much for your work.

Can Dataset support accessKey and secretAccessKey in Secret?

Can Dataset support preexisting Secret where accessKeyID and secretAccessKey are stored? There may be two reasons:

End users do not want to store such information in Dataset if they open-sourced their project. It can be a security risk.
In larger organizations, such information may not be available to programmers. Administrators create K8S Secret on their behalf.

Related, the rest of the information: endpoint, bucket, region may be available in a configmap. Please consider that as secondary.

Dataset Operator permission mismatch

Hi,

I have cluster with pod security policy enabled. When I try to deploy operator I always get Error: container has runAsNonRoot and image will run as root in dataset-operator Deployment.

The issue is resolved with adding security context under spec.template.spec. I used

spec:
      securityContext:
        runAsUser: 1000

and pod starts now.

Could similar fix be added to the code?

Dataset using NFS cannot be attached to pods

kubectl logs csi-attacher-nfsplugin-0 -c csi-attacher on the cluster showed that the volume could not be attached as there was no patch permission for csi-attacher-nfs-plugin

Installation of datashim results in an error

I'm getting the following error while deploying datashim:
error: unable to recognize "https://raw.githubusercontent.com/datashim-io/datashim/master/release-tools/manifests/dlf.yaml": no matches for kind "CSIDriver" in version "storage.k8s.io/v1"
Earlier installations were successful. I suspect that #105 is the cause.

ARCHIVE Type is having issue after failing to push datasets to S3

When we are trying to push a big dataset using the ARCHIVE type, sometimes it will fail due to the large workload. Then after that, all other datasets created by the same DLF cluster won't able to be mounted on any pod. Redeploy DLF and minio won't solve this issue.

Here are the events after pod failing to mount on DLF's PVC:

18m         Warning   ProvisioningFailed      persistentvolumeclaim/example-dataset                       failed to provision volume with StorageClass "csi-s3": rpc error: code = DeadlineExceeded desc = context deadline exceeded
2s          Warning   FailedMount             pod/nginx                                                   Unable to attach or mount volumes: unmounted volumes=[example-dataset], unattached volumes=[example-dataset default-token-7qlxh]: timed out waiting for the condition
3m11s       Warning   VolumeFailedDelete      persistentvolume/pvc-6f6e9892-7fdf-4d8f-b1a2-c75d416c9b97   rpc error: code = Unknown desc = failed to initialize S3 client: Endpoint:  does not follow ip address or domain name standards.

Here is the dataset that can ruin the whole DLF cluster:

cat <<EOF | kubectl apply -f -
apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: example-dataset
  namespace: default
spec:
  type: "ARCHIVE"
  url: "https://dax-cdn.cdn.appdomain.cloud/dax-oil-reservoir-simulations/1.0.0/oil-reservoir-simulations.tar.gz"
  format: "application/x-tar"
EOF

Dataset for the 1000 Genome project

I am trying to mount a dataset for the 1000 Genome project - https://registry.opendata.aws/1000-genomes/
I have created the dataset object using:

---
apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: 1000-genome-dataset
spec:
  local:
    type: "COS"
    accessKeyID: ""
    secretAccessKey: ""
    endpoint: "https://s3-us-east-1.amazonaws.com"
    bucket: "1000genomes"
    readonly: "true" #OPTIONAL, default is false

The PVC for the dataset gets provisioned but when trying to mount into a pod, get errors like this:

Warning  FailedMount  92s   kubelet            MountVolume.SetUp failed for volume "pvc-a785f139-e992-4790-a2f7-57ad1efa5476" : rpc error: code = Unknown desc = Error fuseMount command: goofys
args: [--endpoint=https://s3-us-east-1.amazonaws.com --profile=pvc-a785f139-e992-4790-a2f7-57ad1efa5476 --type-cache-ttl 1s --stat-cache-ttl 1s --dir-mode 0777 --file-mode 0777 --http-timeout 5m -o allow_other -o ro 1000genomes /var/lib/kubelet/pods/256d5c70-5751-4aaa-8095-fc951047a3db/volumes/kubernetes.io~csi/pvc-a785f139-e992-4790-a2f7-57ad1efa5476/mount]
output: 2021/05/03 22:35:32.659638 main.FATAL Unable to mount file system, see syslog for details

Any pointer would be greatly appreciated

design document

Could you please share the design document to understand how it mounting the bucket/nfs share ... ?

Fill out Best Practices Badge + add to README

https://bestpractices.coreinfrastructure.org/

This will help you as part of your CNCF application process

Create new directory in NFS for each Dataset deployment

Hi,
today I tried to deploy both S3 and NFS Datasets in our environment and they work flawlessly.
However, I found out that the NFS doesn't set up a new directory for each deployment but uses same for all.
Beforehand we've been using nfs-client-provisioner (some helm chart) but it is deprecated now. With that, you configured the nfs path and server and it created new directory for each PVC (under configured nfs path).
This behaviour is very handy because when you don't know in advance what you need, the deployment will create it for you and you don't have to worry about creating new path for each Pod.

Could this be supported?

Dataset is stuck in Pending state

Scenario: created a Dataset CRD named "kind-example-v0.2-try6-cp4d3f6d318f7c".

apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: kind-example-v0.2-try6-cp4d3f6d318f7c
spec:
  local:
    type: "COS"
    secret-name: "bucket-creds"
    secret-namespace: "m4d-system"
    endpoint: "http://s3.eu.cloud-object-storage.appdomain.cloud"
    provision: "true"
    bucket: "kind-example-v0.2-try6-cp4d3f6d318f7c"

Problem: A bucket has been successfully created. However, the Dataset status is stuck on "Pending".
Reason:
This is caused by a failure to reconcile a pvc resource. From csi-provisioner-s3-0 log in dlf namespace:
volume_store.go:144] error saving volume pvc-80abdaf5-bb2e-4f3b-a733-4ea96c0f1552: PersistentVolume "pvc-80abdaf5-bb2e-4f3b-a733-4ea96c0f1552" is invalid: spec.csi.name: Invalid value: "kind-example-v0.2-try6-cp4d3f6d318f7c": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'

Add arm based images

Node-driver-registrar not working after reboot

Hi,
I would like if container node-driver-registrar in daemonset csi-nodeplugin-nfsplugin is expected to fail and not restart after some node problem. If a node reboots, the container node-driver-registrar stops working with log

I0202 01:11:30.758475       1 node_register.go:58] Starting Registration Server at: /registration/nfs.csi.k8s.io-reg.sock
I0202 01:11:30.758743       1 node_register.go:67] Registration Server started at: /registration/nfs.csi.k8s.io-reg.sock
I0202 01:11:31.425742       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0202 01:11:54.363628       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E0202 14:07:00.314932       1 connection.go:129] Lost connection to unix:///plugin/csi.sock.

I've found out when I created job which had to mount some PVC with csi-nfs storageclass and it did not schedule on the node which was rebooted yesterday. Logs from the job:

Warning  FailedMount  51s                  kubelet            Unable to attach or mount volumes: unmounted volumes=[dest-volume], unattached volumes=[dest-volume default-token-9btxt]: timed out waiting for the condition
  Warning  FailedMount  45s (x9 over 2m53s)  kubelet            MountVolume.MountDevice failed for volume "pvc-bd3d6316-0342-45f0-981d-0cdc9ca165c3" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name nfs.csi.k8s.io not found in the list of registered CSI drivers

Shouldn't it be somehow periodically ensured that the daemon is alive? I can create a cronjob or something for my cluster but thought I will ask first.

Support H3 as an additional dataset type

H3 is an embedded High speed, High volume, and High availability object store, backed by a high-performance key-value store (RocksDB, Redis, etc.). H3 also provides a FUSE implementation to allow object access using file semantics. The CSI H3 mount plugin (csi-h3 for short), allows you to use H3 FUSE for implementing persistent volumes in Kubernetes.

In practice, csi-h3 implements a fast and efficient filesystem on top of a key-value store. With csi-h3 deployed, and a Redis server running, you just need to specify the Redis endpoint and the bucket name you want to use, in order to get a mountpoint for your containers. H3 is embedded in csi-h3, so there are no other requirements to install.

H3 could be supported In DLF, with a dataset definition like the following:

apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: example-dataset
spec:
  local:
    type: "H3"
    storageUri: "redis://redis.default.svc:6379"
    bucket: "b1"

Note that H3 supports many additional key-value stores, but in the distributed environment of Kubernetes, you need a key-value store that can be accessed through a network protocol. For persistent storage, Ardb provides Redis connectivity over a range of key-value implementations, including RocksDB, LevelDB, and others. In that case, the storageUri used will still be in the form redis://..., but the actual service will be provided by Ardb.

Use base64-encoded secrets for dataset configuration

I was trying to configure an S3 dataset with a separate secret definition and realized that Datashim works only when the secrets are in the stringData format. Since kubectl creates strings with values in data, it would be more convenient to allow both formats.

Admission controller not working on K8s 1.19.x

Need to investigate a bit, but admission controller is not working on k8s 1.19.x version as it complains about the self signed certificates

Nooba Install on Mac OS X fails

$ make minikube-install
results in

Installing NooBaa...done
Building NooBaa data loader...done
Creating test OBC...error: the server doesn't have a resource type "obc"
error: the server doesn't have a resource type "obc"
error: the server doesn't have a resource type "obc"
error: the server doesn't have a resource type "obc"
error: the server doesn't have a resource type "obc"
error: the server doesn't have a resource type "obc"
error: the server doesn't have a resource type "obc"
error: the server doesn't have a resource type "obc"
error: the server doesn't have a resource type "obc"

This is happening because of this line in examples/noobaa/noobaa_install.sh:
wget -P ${DIR} https://github.com/noobaa/noobaa-operator/releases/download/v2.0.10/noobaa-linux-v2.0.10 > /dev/null 2>&1

Errors during `make minikube-install`

Running make minikube-install mostly seems to work, but between loading the images into minikube and applying the yaml, spits out these errors:

/bin/bash: ./release-tools/generate-keys.sh: No such file or directory
/bin/bash: line 1: /tmp/tmp.w89e0ut2NV/ca.crt: No such file or directory
W0921 17:11:40.845263  162519 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
error: Cannot read file /tmp/tmp.w89e0ut2NV/webhook-server-tls.crt, open /tmp/tmp.w89e0ut2NV/webhook-server-tls.crt: no such file or directory
error: no objects passed to apply
/bin/bash: line 5: ./src/dataset-operator/deploy/webhook.yaml.template: No such file or directory
error: no objects passed to apply

It seems like the generate-keys.sh stuff happens in-cluster now, so maybe this is nothing to worry about? It's a bit disconcerting though :-)

Update deprecated apiextensions.k8s.io/v1beta1 and admissionregistration.k8s.io/v1beta1

The following apiVersion will be deprecated in in v1.22 and are used in the dataset-operator:

admissionregistration.k8s.io/v1beta1 => admissionregistration.k8s.io/v1
apiextensions.k8s.io/v1beta1 => apiextensions.k8s.io/v1

I have done manual some tests and upgrade the admissionregistration apiVersion from

datashim/src/dataset-operator/chart/templates/apps/webhook-definition.yaml

Line 1 in fe8a3e4

apiVersion: admissionregistration.k8s.io/v1beta1

is easy and work well. 👌

Anyway, the problem where I blocking is on the migration of apiextensions.k8s.io/v1beta1 from

datashim/src/dataset-operator/chart/templates/crds/com.ie.ibm.hpsys_datasets_crd.yaml

Line 1 in fe8a3e4

apiVersion: apiextensions.k8s.io/v1beta1

Following these recommendations, I have tried to update the dataset-operator CRD files and successfully deploy them.
Here what I have for the file src/dataset-operator/chart/templates/crds/com.ie.ibm.hpsys_datasets_crd.yaml:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: datasets.com.ie.ibm.hpsys
spec:
  group: com.ie.ibm.hpsys
  names:
    kind: Dataset
    listKind: DatasetList
    plural: datasets
    singular: dataset
  scope: Namespaced
  versions:
  - name: v1alpha1
    subresources:
      status: {}
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        description: Dataset is the Schema for the datasets API
        properties:
          apiVersion:
            description: 'APIVersion defines the versioned schema of this representation
              of an object. Servers should convert recognized schemas to the latest
              internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
            type: string
          kind:
            description: 'Kind is a string value representing the REST resource this
              object represents. Servers may infer this from the endpoint the client
              submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
            type: string
          metadata:
            type: object
          spec:
            description: DatasetSpec defines the desired state of Dataset
            properties:
              local:
                additionalProperties:
                  type: string
                description: 'INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
                  Important: Run "operator-sdk generate k8s" to regenerate code after
                  modifying this file Add custom validation using kubebuilder tags:
                  https://book-v1.book.kubebuilder.io/beyond_basics/generating_crd.html
                  Conf map[string]string `json:"conf,omitempty"`'
                type: object
              remote:
                additionalProperties:
                  type: string
                type: object
            type: object
          status:
            description: DatasetStatus defines the observed state of Dataset
            properties:
              error:
                description: 'INSERT ADDITIONAL STATUS FIELD - define observed state
                  of cluster Important: Run "operator-sdk generate k8s" to regenerate
                  code after modifying this file Add custom validation using kubebuilder
                  tags: https://book-v1.book.kubebuilder.io/beyond_basics/generating_crd.html'
                type: string
            type: object

But when I try to deploy a new simple S3 Dataset:

apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: test
spec:
  local:
    type: "COS"
    accessKeyID: "KeyID"
    secretAccessKey: "Secret"
    endpoint: "https://s3.eu-west-1.amazonaws.com"
    bucket: "test-bucket"
    readonly: "true" #OPTIONAL, default is false

The controller sees the new resource:

...
{"level":"info","ts":1622124637.1511607,"logger":"controller_dataset","msg":"Reconciling Dataset","Request.Namespace":"default","Request.Name":"test"}
{"level":"info","ts":1622124637.1608975,"logger":"controller_dataset","msg":"Reconciling Dataset","Request.Namespace":"default","Request.Name":"test"}

But nothing happens. 😞
I can describe the dataset resource but the PVC is not created.

If some people can help on this. 🙂
I'm ready to help and contribute but blocking on this.

Existing files owned by root inaccessible by nonroot users

Storage buckets work nicely if they are empty. The existing files and directories are owned by root so they are inaccessible by non-root users in a container. I have tried object stores on GCP, AWS and a custom S3-compatible object store. Note that this applies to the files and directories created outside of DLF via S3 APIs. The ones created via DLF have the correct ownership if a bucket is mounted a second time.

DaemonSets csi-s3 and csi-nodeplugin-nfsplugin are unable to create pods due to missing RBAC configuration for ServiceAccounts

The manifest file does not configure the SecurityContextConstraints for the following ServiceAccounts:

csi-provisioner
csi-s3
csi-nodeplugin
csi-attacher

As a result, on OpenShift, containers which are expecting to run in privileged mode are unable to get access to features such as hostNetwork, and hostPath. In turn, the DaemonSets csi-s3 and csi-nodeplugin-nfsplugin are unable to spawn pods on the cluster nodes because the ServiceAccounts csi-s3 and csi-nodeplugin are not registered as users in the privileged SecurityContextConstraints. Similar issues manifest due to the ServiceAccounts csi-attacher and csi-provisioner not being registered as users in the privileged SecurityContextConstraints.

Done when

The instructions in the README.md file address the RBAC configuration of the service accounts. This can either be done via

oc adm policy add-scc-to-user privileged -n dlf -z csi-provisioner -z csi-s3 -z csi-nodeplugin -z csi-attacher

alternatively, the instructions could use the JSON patch feature of the kubectl utility like so:

kubectl patch scc privileged --type=json -p '[
  {"op": "add", "path": "/users/-", "value": "system:serviceaccount:dlf:csi-provisioner"}, 
  {"op": "add", "path": "/users/-", "value": "system:serviceaccount:dlf:csi-nodeplugin"}, 
  {"op": "add", "path": "/users/-", "value": "system:serviceaccount:dlf:csi-attacher"}, 
  {"op": "add", "path": "/users/-", "value": "system:serviceaccount:dlf:csi-s3"}]'

S3 config format

I am trying to set this up with an S3 bucket. I have provided my keys. I get:

	failed to provision volume with StorageClass "csi-s3": rpc error: code = Unknown desc = failed to initialize S3 client: Endpoint: does not follow ip address or domain name standards.

I have:

endpoint=s3.eu-west-2.amazonaws.com

which is clearly wrong, but what would be RIGHT?

Addressing by name or ID

The README states addressing of datasets is done by "using the unique ID defined at creation time" - when I look at the example it looks like the addressing is done by the name of dataset CR - can you maybe clarify that in the README?

Add extract option to the dataset spec

We want to have an option to turn on/off for extracting dataset tar file to s3.

Provide an example for using DLF with Spark

Implement ceph-based caching plugin

Cache remote s3 buckets using ceph.

Tracked branch: https://github.com/IBM/dataset-lifecycle-framework/commits/fixed-caching

Wiki with instructions: https://github.com/IBM/dataset-lifecycle-framework/wiki/Ceph-Caching

Specify access mode in NFS Dataset

As asked in comments in #67 , I'm opening a new issue tracking specifying access mode in NFS Dataset.

Goal: be able to specify whether RWO or RWX access mode should be used
A way how to do that is lightly described in #67

Have an explicit mutatingwebhookconfiguration resource in the dlf.yaml deployment.

We want to have the mutatingwebhookconfiguration explicitly defined in the dlf.yaml deployment. So that we can also delete the webhook using the same dlf.yaml

Fix noobaa installation script

The problem lies in the lines with sleep 15 like here https://github.com/IBM/dataset-lifecycle-framework/blob/10f3be95913aa9624245c9c48f750a4d7d9dbfc8/examples/noobaa/noobaa_install.sh#L28

We should wait until the objects are ready and not 15 seconds

Support specifying which cache plugin (or None) to use for a particular dataset

I noticed that at the moment DLF will query for the installed caching plugins and will always use the first result (if any) to cache the dataset. However, in the case where multiple caching plugins are installed, it would come handy to be able to specify which one to use and cache the dataset. Also, there are cases that the user may want to opt-out of caching at all.

As a solution to the above points, I am thinking that a new label in the dataset, with the key cache.plugin and as value the name of the caching plugin to use, could be used to identify which plugin to use against the installed ones. Also, when the value of this plugin is None the user could easily opt-out caching the dataset.

Any thoughts?

Thanks

Integrate DLF into ODH JupyterHub

In opendatahub.io we use https://github.com/opendatahub-io/jupyterhub-singleuser-profiles to customize Jupyter instances for users.

I'd like to come up with a good integration between DLF and the profiles library.

As discussed with @YiannisGkoufas we would need to be able to mount volumes read-only too avoid users overwritting datasets for others

Then we could add a simple way to automatically label pods and add mounts to pods for users from a specific group

Show example of using python apis

Document an example of using python's k8s apis, ie create_namespaced_custom_object

cc @davidyuyuan

Limit access to datasets to specific pods running specific images

Currenty the user can create a dataset like this:

apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: your-dataset
spec:
  local:
    type: "COS"
    accessKeyID: "{AWS_ACCESS_KEY_ID}"
    secretAccessKey: "{AWS_SECRET_ACCESS_KEY}"
    endpoint: "{S3_SERVICE_URL}"
    bucket: "{BUCKET_NAME}"
    region: "" #it can be empty

Then if they specify a pod like this:

apiVersion: v1
kind: Pod
metadata:
  name: simple-nginx
  labels:
    dataset.0.id: "your-dataset"
    dataset.0.useas: "configmap"
spec:
  containers:
    - name: nginx
      image: nginx

It will be mutated as follows:

    - configMapRef:
        name: your-dataset
      prefix: your-dataset_
    - prefix: your-dataset_
      secretRef:
        name: your-dataset

As a result the credentials would be available in the pod with the your-dataset_ prefix as env variables.

However, there are scenarios where we only want authorized images to access the credentials and not any pod.

We are designing with @mrsabath how this could be achieved with https://github.com/IBM/trusted-service-identity and this issue will capture this process.

From the DLF perspective we need to upload to Vault the secrets once a Dataset is created. The key-values would look like this:

<cluster>/<namespace>/<dataset>/accessKeyID
<cluster>/<namespace>/<dataset>/secretAccessKey
....

Then we need to modify our admission controller would add the necessary labels to the user's pod which would allow TSI to check whether this pod can use these credentials or not. Ideally it should work as before and expose it as env variables:

<dataset>_accessKeyID = xxxxx
<dataset>_secretAccessKey = xxxx

In case the image is not authorized, the credentials should not be injected

DLF labels are having conflicts with Istio sidecar Injection

When trying to deploy a pod with DLF labels inside an namespace with istio inject, I'm seeing the errors below. It looks like there's some conflicts between the DLF and istio mutation.

The Pod "nginx" is invalid: spec.volumes[4].name: Duplicate value: "example-dataset"

Here is my pod

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: default
  labels:
    dataset.0.id: "example-dataset"
    dataset.0.useas: "mount"
spec:
  containers:
    - name: nginx
      image: nginx
EOF

Problem Mounting Existing Bucket

Hi there,

Thanks for the efforts, finding this project very useful. One issue i'm having however (apologies if it's obvious) is mounting an existing bucket, even though I am specifying the bucket in the secret e.g this is for a non-AWS s3 endpoint;

apiVersion: v1
data:
  accessKeyID: accessKey
  bucket: bucket
  endpoint: endpoint
  region: ""
  secretAccessKey: secretAccessKey
kind: Secret
metadata:
  name: csi-s3-pvc
  namespace: test-namespace
type: Opaque

Rather than mounting the specified bucket, it instead generates a new bucket with the name of the Kubernetes pvc. Just want to confirm that I am doing things correctly and if not what I need to change?

Versions:
Attacher: 2.2.0
Provisioner: 1.6.0

Watch multiple namespaces

By default we are looking at all the namespaces, we should pass a list of namespaces to monitor instead

Questions on the csi-nodeplugin rbac

From https://github.com/IBM/dataset-lifecycle-framework/blob/master/src/csi-driver-nfs/chart/templates/csi-nodeplugin-rbac.yaml#L38

Should this rbac map to the namespace where csi-nodeplugin SA is deployed? Looks like this is a typo.

Creating a dataset from s3 bucket with 1GB of data results in ~9000 GB PVC

Hello and thank you for this really cool project.

I am trying to create a dataset on a k8s cluster that is hosted on an open stack provider. And it seems that every time I create a dataset I get a pvc and pv that are very large (9314 Gi) even though the S3 bucket I am using only has dummy data that is less than 1 GB in total.

NAME                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
example-dataset-gcs   Bound    pvc-85d52fa7-ab1e-4a4a-abb9-2ab687455188   9314Gi     RWX            csi-s3         22m

I thought this was happening because I was using open stack s3 compatible storage. However the same thing occurred when using gcs (which is also supposed to be s3 compatible). I apologize I could not try AWS s3 out because I do not have easy access to an account.

Is there a way to specify how big the PV/PVC can be?

This is my only issue. Everything else seems to work.

I followed the templates here for creating the dataset:
https://github.com/IBM/dataset-lifecycle-framework/blob/master/examples/templates/example-dataset-s3-secrets.yaml
https://github.com/IBM/dataset-lifecycle-framework/blob/master/examples/templates/example-s3-secret.yaml

I installed using this command:

kubectl apply -f https://raw.githubusercontent.com/IBM/dataset-lifecycle-framework/master/release-tools/manifests/dlf.yaml

For this I am using kubernetes 1.19.6 on a rke cluster deployed on an openstack provider.

support IBM Cloud IAM API Key instead of HMAC keypair when configuring COS bucket as dataset

The use case is about working with data on IBM COS. I followed the guide here: https://github.com/IBM/dataset-lifecycle-framework/wiki/Data-Volumes-for-Notebook-Servers#create-a-dataset-for-the-s3-bucket

where it creates a COS bucket, it needs:

apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: your-dataset
spec:
  local:
    type: "COS"
    accessKeyID: "access_key_id"
    secretAccessKey: "secret_access_key"
    endpoint: "https://YOUR_ENDPOINT"
    bucket: "YOUR_BUCKET"
    region: "" #it can be empty

Which requires a service credential to be created.

I wonder if it can support creating dataset via:

apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: your-dataset
spec:
  local:
    type: "COS"
    ibm_cloud_iam_apikey: "<base64 encoded api key>"
    bucket: "YOUR_BUCKET"
    region: "" #it can be empty

Which will make COS admin's life much easier since it can delegate secret management/rotate to IBM Cloud IAM.

Removing bogus kubelet error message on IKS if possible

On IKS, mounting S3 CSI drivers still shows the below errors. Although this is not blocking the pod from mounting, it takes the kubelet few minutes to realize the message is bogus which creates some bottleneck on mounting time.

Events:
  Type     Reason       Age   From               Message
  ----     ------       ----  ----               -------
  Normal   Scheduled    5m6s  default-scheduler  Successfully assigned default/nginx to 10.168.14.70
  Warning  FailedMount  5m    kubelet            MountVolume.SetUp failed for volume "pvc-ae703fc0-26d4-4ba2-bb92-2d709985e72b" : kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = Unknown desc = Error fuseMount command: goofys
args: [--endpoint=http://minio-service.kubeflow:9000 --profile=pvc-ae703fc0-26d4-4ba2-bb92-2d709985e72b --type-cache-ttl 1s -f --stat-cache-ttl 1s --dir-mode 0777 --file-mode 0777 --http-timeout 5m -o allow_other -o ro e6dbfd34-1ed9-11eb-8b10-d62589704c0d /var/data/kubelet/pods/f3889d7b-0ee6-40d3-8add-87ddb82a1901/volumes/kubernetes.io~csi/pvc-ae703fc0-26d4-4ba2-bb92-2d709985e72b/mount]
output: 2020/11/04 20:11:37.734151 s3.ERROR code=NoCredentialProviders msg=no valid providers in chain. Deprecated.
  For verbose messaging see aws.Config.CredentialsChainVerboseErrors, err=<nil>

2020/11/04 20:11:37.734280 main.ERROR Unable to access 'e6dbfd34-1ed9-11eb-8b10-d62589704c0d': NoCredentialProviders: no valid providers in chain. Deprecated.
  For verbose messaging see aws.Config.CredentialsChainVerboseErrors
2020/11/04 20:11:37.734297 main.FATAL Mounting file system: Mount: initialization failed
  Warning  FailedMount  3m3s   kubelet  Unable to attach or mount volumes: unmounted volumes=[example-dataset], unattached volumes=[default-token-wspcg example-dataset]: timed out waiting for the condition
  Warning  FailedMount  2m59s  kubelet  MountVolume.SetUp failed for volume "pvc-ae703fc0-26d4-4ba2-bb92-2d709985e72b" : kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Normal   Pulling      2m50s  kubelet  Pulling image "nginx"
  Normal   Pulled       2m49s  kubelet  Successfully pulled image "nginx"
  Normal   Created      2m49s  kubelet  Created container nginx
  Normal   Started      2m49s  kubelet  Started container nginx

Dataset labels not working for deployments

When I use a dataset label to mount a dataset to a deployment, nothing happens.

Upon further inspection, it looks like the MutatingWebhookConfiguration does not trigger for deployments, but only for pods. For deployments, the mutate function should do exactly the same changes it does for pods, but work on the /spec/template/spec path instead of /spec.

Multiple namespaces installation problems

If you do the full installation on a different namespace, the previous installation breaks.
We need to have 2 things fixed

In the installation to check if the dataset operator is running in another namespace and prevent re-installation
Explain on the wiki how to extend existing installation to multiple namespaces

FYI @davidyuyuan

Provide helm-based installation

It might be easier to provide a helm chart than custom scripts with envsubst

Multi user support in NFS shares

Hi,
another question 😅
Have you thought about how to prevent users from mounting pvc's of others in NFS?

We have one export share. When user is creating a Dataset, he/she needs to specify the path.
Let's say the path is /nfs/export and option createDirPVC: "true". In this case, the user gets his/her own share at /nfs/export/myshare. However, nothing stops user from mounting whole export just by sepcifying path as /nfs/export and setting createDirPVC: "false"

I think this is a great issue in multitenancy environments and therfore unusable because of big security problem. Maybe if helm chart was up and ready for use, the path could be configurable somewhere in values and the user wouldn't actually have to specify if he/she wants to create a directory but default setting would be to create a directory with name path + Dataset name and mount only the resulting path in PVC.

Throw errors for names containing illegal characters

As reported here #106
In dataset definitions like this:

apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: kind-example-v0.2-try6-cp4d3f6d318f7c
spec:
  local:
    type: "COS"
    secret-name: "bucket-creds"
    secret-namespace: "m4d-system"
    endpoint: "http://s3.eu.cloud-object-storage.appdomain.cloud"
    provision: "true"
    bucket: "kind-example-v0.2-try6-cp4d3f6d318f7c"

there should be the necessary message in the dataset.status
FYI @shlomitk1