Store Notebooks persistent, in case of Docker Container removal

enhance entry point to copy notebooks with default content, if no user provided notebooks are present in mounted directory

Create CI build workflow to build and push Docker Image

Potential Trigger: On pushing a new tag to main branch
Implementation requires to update file release_droid_upload_github_release_assets.yml
Note currently the workflow requires AWS codebuild to succeed (see ticket #23)

Acceptance criteria

Tests have been renamed / moved:
- git mv test/aws test/integration/aws
- git mv test/ci test/codebuild (requires to update script aws-code-build/ci/buildspec.yaml, too)
CI Build has been enhanced to additionally execute all tests in folder test/integration
Developer Guide has been enhanced to describe CodeBuild triggering in more detail
Generated release letter is enhanced to also name the Docker image

Details on CodeBuild triggering

On release

user calls release-droid,
which in turn runs GitHub action release_droid_upload_github_release_assets.yml.
this calls exasol.ds.sandbox.main start-release-build
lib/release_build/run_release_build.py

Currently, the Webhook is configured in GitHub so that GitHub sends a post request on specific events, e.g. push.
Apparently the event contains the name of the current branch e.g. refs/heads/main and the commit message.
The CodeBuild in the AWS Stack configured to evaluate the event and to start on commit message "[CodeBuild]".

Open questions:

The web hooks in the GitHub repo are created by AWS template templates/ci_code_build.jinja.yaml
The ci user used by AWS Codebuild therefore has been granted access to the GitHub repo incl. the permission to create web hooks

Details on release letter

Create a table-of-contents notebook

This should be the header notebook of the set of tutorials. It should

Include links to all other notebooks.
Provide the structure of the tutorials.

DSS Rename classes and packages in lib/ansible

Replace from exasol.ds.sandbox.lib.ansible.ansible_run_context import AnsibleRunContext
by import exasol.ds.sandbox.lib.ansibleand call ansible.RunContext

This could be done by creating a file ansible.__init__.py with content import ...ansible.repository
and enables usage: ansible.repository.default.

@Nicoretti :

Having a module ansible_access which just contains a class AnsibleAccess does not make much sense and only increases noise and verbosity imho. (ansible_access.AnsibleAcces vs ansible.Access)

ansible.Access should be as descriptive.

Fix setup_db.ipynb

make it reusable from other notebooks
make parameters to be redefinable

Make systemd call entrypoint.py

Ticket #69 described adding an entry point to the DSS Docker Container Images in order to copy notebook files and start the jupyter server.

The entrypoint was implemented as Python script entrypoint.py.
The current ticket requests to call entrypoint.py for other image types (AMI, vmdk, vdh), too.

Please note: This is not required for calling magic %pip, see #273

Move SLC notebook to repo ai-lab

Install exasol notebook connector via ansible

Improve logging of Ansible tasks

Currently in case of a failure the ci-build does not show the original cause, e.g. outdated version of an ubuntu package pinned in the ansible scripts.
Example: https://github.com/exasol/data-science-sandbox/actions/runs/6877900023/job/18706405162

Additionally in support for investigations requested by #59 this ticket requests to report the duration of single ansible steps.

Remove comment line from setup_db.ipynb

Or maybe change the approach completely, depending on discussions

Create a notebook to securely manage sandbox configuration

Provide a function and UI to open the Secret store.
Provide a UI to enter configuration and put it into the Secret store.

Setup CI incl. AWS CodeBuild

Tasks

Beta Give feedback

Set up the AWS Code Build projects for release and CI.
Test if CI works
Make Test release
Options

Test release: see https://github.com/exasol/data-science-sandbox/releases/tag/untagged-b9b5735e81bc381a4409

Rename all occurences of script-languages-developer-sandbox to data-science-sandbox

Remove slc_version from Ansible Context

run_install_dependencies() currently requires entry slc_version in ConfigObject.

The current ticket requests to remove the need for this entry.

Use the Secret Store in the Learning-in-the-notebook tutorial

Start using the Secret Store
Revisit formatting in text cells.

Remove unnecessary apt-update

Currently, there are multiple ansible tasks calling apt update (update_cache: true or update_cache: yes).
This takes some time and is only required once in the beginning.

The current requests to remove unnecessary calls to apt update.

Create a notebook(s) to show training with scikit-learn in the notebook

Create a series of notebooks to show training of a customer model in a notebook. Some of these notebooks, e.g. uploading data, will be shared in other tutorials.

SageMaker Extension tutorial: Implement the job status polling loop

If we want to test the notebook later automatically, we need to implement the poll loop, instead of entrusting the polling to the user.

Document running the Data Science Sandbox as Docker container while using ITDE and EXASLCT

The current ticket requests to document how users can run the DSS as Docker Container while using ITDE and EXASLCT.

This includes Mapping the port of the docker socket from host to inside the Docker Container.

PR #30 added creation of a Docker Image for DSS and was inspired by test_install_dependencies.py
On of the findings during review of the PR's was that the docker socket was not mapped.

Use a non-root user to run Jupyter in the Docker Image

Security best practices recommend not to run services in Docker with root. We should prepare a non-root user which is used to run Jupyter.

https://cheatsheetseries.owasp.org/cheatsheets/Docker_Security_Cheat_Sheet.html#rule-2-set-a-user

In total we will need two users besides root

User with sudo permission used for Ansible installation using sudo where required.
- name ansible for Docker Edition
- name ubuntu AMI and VM images
User jupyter without sudo permission useds for running the docker container and Jupyterlab.

Investigate persistent storage of user modified files, e.g. Jupyter notebooks

Use case / requirements

DSS should bring some files with an a-priori default content
Users of DSS should be able to modify these files by using the DSS
DSS should enable users to save the modified files persistently
When starting the DSS next time users should be able to continue with the last saved state of their files

The current ticket requests to investigate how these requirements could be achieved.

Idea: Maybe Docker VOLUME can be used to achieve this functionality?
See https://docs.docker.com/engine/reference/builder/#volume

Move jupyter notebook files again

In ticket #53 these files have been moved to ansible/roles/jupyter/files/notebook
But they needed to be moved to exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook.

The current ticket request to move the files once more.

Consider enforcing password rules for the Secret Store password

At the moment the rules are not enforced. Also, a user is required to enter the password in every notebook. This could motivate them to use the simplest, e.g. one-letter, password.

CLI: setup-vm-bucket-waf: default value for cli option allowed-ip

DSS stores AMI images in an publicly accessible AWS S3 bucket. In case of automated downloads Exasol company faces the risk of significant costs as AMI images are quite large (n GB).

To avoid this risk the S3 bucket is protected by a wab application firewall (WAF) which asks the downloader to solve a captcha. This should prevent at least simple implementations of automated downloaders.

According to AWS documentation WAF with captcha is only supported for region us-east-1 (N. Virginia), which should be covered already by default setting for waf_region in file exasol/ds/sandbox/lib/config.py.

Additionally DSS does not require any access to the S3 bucket without WAF but AWS CLI requires to specify at least one ip address to be excluded.

poetry install # if you modified the template
export AWS_DEFAULT_REGION=us-east-1
poetry run exasol/ds/sandbox/main.py setup-vm-bucket-waf --allowed-ip 127.0.0.1
poetry run exasol/ds/sandbox/main.py setup-vm-bucket --aws-profile ci4_mfa

The current ticket therefore asks to

Enhance the DSS developer guide to describe the region restriction and how to handle it
Add a default value to CLI option --allowed-ip of DSS command setup-vm-bucket-waf

Current implementation:

@click.option('--allowed-ip', type=str)

Proposal:

@click.option('--allowed-ip', type=str, default="127.0.0.1", show_default=True)

See https://click.palletsprojects.com/en/8.1.x/options/

Move SLC cloning to notebook

In the past SLC was cloned by Ansible task exasol/ds/sandbox/runtime/ansible/roles/script_languages/tasks/main.yml

In the future this should be done by the related SLC notbook in the DSS.

This will require changes to

user_guide.md
test/ansible_conflict/slc_setup.yml
test/test_ansible.py
test/ansible_conflict/slc_setup.yml
test/test_ci.py
test/test_install_dependencies.py
test/test_release_build.py
exasol/ds/sandbox/lib/config.py
exasol/ds/sandbox/lib/setup_ec2/run_install_dependencies.py

Investiate reuse strategies for docker image

Potential use cases

UC-1

notebook developer Nadine works on creating a new Jupyter notebook or updating an existing one
Nadine wants to use new libraries that are not available yet in the latest release on docker-hub
Nadine therefor wants to build a private Docker image from the branch Nadine is currently working on

UC-2

As image creation currently (Nov 2023) takes around 7 minutes Nadine wants to reuse the image in follow-up usage

UC-3

Nadine changed a file or dependency that requires to re-create the Docker image taking the change into account

Investigate how to store the Notebooks persistent, in case of Docker Container removal

The current implementation is

Create and deliver the Docker Container with default notebook files in folder $BACKUP
Mount a host directory to mount point $MOUNT inside the Docker Container
Add an entry point to the Docker Container
For each notebook file the entry point checks if the file is present in $MOUNT
If not then the entry point copies the file from $BACKUP to $MOUNT
- Entry point only copies missing files and directories (FS objects)
- Copied FS objects are owned by root (of Docker Container)
- Entry point relaxes permissions for copied FS objects to still allow host user outside the Docker Container modifying and deleting these FS objects in the mounted directory

Directory setting in current implementation

Folder	Type of DSS image	AMI, VM images vhd, vmdk
`$BACKUP`	`{{ user_home }}/notebook_defaults`	`{{ user_home }}/notebooks`
`$MOUNT`	`{{ user_home }}/notebooks`	mounting is not supported

Additional task:

Try out VOLUME directive in Dockerfile, results: see comment below.

Current implementation uses a python script as entry point as

this is more flexible
easier to test and
offers higher level structures and language features

However, analysis showed that handling owners and permissions for FS objects in the mounted directories is difficult.
We therefore propose to not mount a host directory directly into the Docker container but to use docker volumes for persistent storage. Ticket #78 therefore requests to add instructions to the DSS User Guide.

Remove apt cache to reduce image size

Background

We need to update the apt cache during the setup of the EC-2 instance. Maybe we could decrease the size of the final VM files by cleaning up the apt cache at the end.
Acceptance Criteria

Run:

ansible.builtin.apt:
clean: yes
become: yes
ansible.builtin.file:
path: /var/lib/apt/lists/
state: absent

at the end of the setup

Here a measurement on our image

$ docker run -it exasol/data-science-sandbox:0.1.0-dev-1 bash
root@9fed5334f0a2:/# du -sh  /var/lib/apt/lists/
46M	/var/lib/apt/lists/

Fix CI Build (1d)

Triggered by ticket https://github.com/exasol/integration-tasks/issues/278

SQLite Documentation: https://docs.python.org/3/library/sqlite3.html

https://github.com/exasol/data-science-sandbox/blob/main/doc/developer_guide/developer_guide.md

Enable to suppress Ansible output

Currently DSS in ansible_access.py calls external library ansible_runner.run().

Some output is already passed to argument printer defined in ansible_runner.py.

However Ansible prints some more output, that only can be suppressed when adding argument quiet=True to ansible_runner.run(), see documentation of ansible_runner.

The current ticket requests to

add an argument to AnsibleAccess.run() enabling callers to control the value of argument quiet to method ansible_runner.run().
Update callers like ansible_runner.py to pass argument quiet depending on the current log_level, e.g. quiet = (log_level >= logging.INFO).

CLI: Refactor location of scripts and naming of CLI commands

Renaming, see Developer guide:

setup-ec2 → create-ec2-instance
create-vm → create-vm-image
- TODO check if it is multiple images of different types or only AMI

Location

Currently all cli commands are in one folder exasol/ds/sandbox/cli/commands
The current ticket requests to organize the cli scripts into folders for Docker, EC2, AWS, ...

Additional tasks

Register scripts in poetry to enable execution without run python ...
This could be done together with #21

Move Jupyter notebooks to folder visible to ansible

Ticket #16 requests to Install Jupyter notebooks via ansible.
As a prerequisite the notebook files need to be made visible to ansible.

The current ticket therefore requests to

Move the notebook files from folder doc/tutorials to folder ansible/roles/jupyter/files/notebook.
Usually git will identify the move operation but to be save we should use git mv to keep change history, see https://git-scm.com/docs/git-mv.

To avoid merge conflicts this should only be done after all current pull requests involving changes to one of the notebook files have been merged, e.g. PR #32

Decouple DSS release versions from SLC version

Currently release versions of DSS are based on the version of SLC (https://github.com/exasol/script-languages-release/).

The current ticket requests to decouple release versions of DSS. In effect image names (e.g. of AMIs) no longer need to contain the SLC version but instead only the version of the DSS.

Required changes

documentation: user-guide, developer-guide
build scripts / implementation
exasol/ds/sandbox/lib/release_build/run_release_build.py
test/test_ci.py
exasol/ds/sandbox/cli/options/id_options.py

Enhance setup_ec2 to return AnsibleFacts

Currently AnsibleAccess looks for entry "docker_container" in extra_vars
setup_ec2/run_setup_ec2_and_install_dependencies.py passes HostInfo to run_install_dependencies()
run_install_dependencies adds this to
AnsibleRunner already supports parameter host_infos but does not forward it
to AnsibleAccess

Options

Either AnsibleAccess.run() should return ansible_runner r (which I think breaks the abstraction)
or AnsibleRunner should accept host_infos as additional argument

If using host_infos, then dss_docker/create_image.py should also be updated to forward this info to ansible_runner.

AnsibleAccess should return a dict with

hostnames as keys and the
corresponding fact_cache as values

Caller can extract the resp. fact_cache based on the known hostname.

Docker container: Display usage instructions

Users of the DSS Docker Image should get clear instructions, incl.

how to connect to the jupyter server: hostname and port?
default password
recommendation to change default password

Inside the Docker container this can be done with a logger or python print.
If user did not specify CLI option -d or --detach then the output will be displayed on the console.
Otherwise the user needs to call docker logs.

As the Docker container will only know its own IP, our best option could be to write generic instructions, and as example use localhost and default port of Jupyter.

Reminder to change the password is important in any case.

Create architecture design documents

Background

We need to get an overview of the design and architecture of the Data Science Sandbox, including:

User view
Image Building
Dependencies
Notebook Structures
SLC Activation
Exasol Wrappers

Update strings in images

After renamings in #5 the images should be updated, too.

Folders:

doc/developer_guide/img/
doc/user_guide/img/

Create Notebook(s) to showcase sagemaker-extension

Add notebooks to showcase sagemaker-extension. The workflow can be partially shared with other ML tutorial.

Move cloud-storage-extension setup into notebook-connector

Setup functionality developed in this task #27
needs to be moved to https://github.com/exasol/notebook-connector repo and used in notebooks

Build the Data Science Sandbox as Docker Image

DSS Add entry point to start Jupyter server

Investigation

Beta Give feedback

Read path to notebook files from Ansible fact_cache
Define path for notebook files with default content
Update ansible to initially copy notebook files to backup directory
Options

Backup path is defined in ansible/general_setup_tasks.yml as "{{user_home}}/notebook-defaults".

Add entry point

Beta Give feedback

Add python script entrypoint.py
Copy script with ansible
Enable entrypoint.py to copy notebook files from backup directory to user mount point
Enable entrypoint.py to start jupyter server
Update implementation to copy files from backup directory to user mount point
Add tests for entrypoint.py
Update integration test to expect jupyter server already running
Options

Display how users can connect to Jupyter server has been moved to separate ticket

Lately I also adjusted the folder for Ansible to copy the notebook files initially.
See issue #51

Tasks

Beta Give feedback

Install Jupyter notebooks via ansible into the Docker container, too
Add a test verifying the notebooks to be available in Docker Container
Move files in exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/ to sub-folder slc
Options

Move files in exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/ to sub-folder slc:
- slc_main_build_steps.svg
- bash_runner.py
- script-languages.ipynb

DSS Push Docker Image to a Docker Registry

Please see ITDE cli/options/push_options.py for pushing images to a registry like hub.docker.com

ITDE test utility docker_registry.py

Uses docker container registry
Allocates a free port
Starts the container as local registry for verifying the publication of docker images
In the end calls docker_image_push_base_task

Option force_push

Seems to be evaluated only in this method.
I don't see a check for an image to exist but only about whether the image has been built (recently?)

For integration tests please have a look at file test_api_push_test_container.py of the ITDE and note:

api.push_test_container accepts arguments username and password for source and target docker registries.
The test does not provide these arguments.
api.push_test_container in calls to set_docker_repository_config() will set value None
DockerPushImageBaseTask will use None for fields username and password in auth_config.

Password

Support to provide user name, and make python ask for interactive password entry
support environment variable to pass secret in CI build

Create notebook(s) to showcase the transformer extension

Add a set of notebooks to the Data Science Sandbox to demo the Transformer extension.

Create basic notebooks with cloud-storage-extension

fetch latest jar from cloud-storage-extension github
put cloud-storage-extension in bucketfs
create import scripts
arrange sample data (parquet) on the s3 (need public bucket for this)
import it into the table

Change initial password of jupyter notebook

Currently: "script-languages"
Expected: "dss"

Tasks

Beta Give feedback

update user guide
update file exasol/ds/sandbox/runtime/ansible/roles/jupyter/defaults/main.yml
Options

Content of ansible task:

 password: "{{ lookup('ansible.builtin.env', 'JUPYTER_LAB_PASSWORD', default='script-languages') }}"

Change default port of Jupyter server

Currently 8888 which is quite frequently used for various purposes.

Proposal: use a different port, probably > 20000.

Wikipedia lists some "ephemeral" ports, still containing potential conflicts.

exasol / ai-lab Goto Github PK

ai-lab's People

Contributors

Stargazers

Watchers

ai-lab's Issues

Acceptance criteria

Details on CodeBuild triggering

Details on release letter

Tasks

Tasks

Potential use cases

Directory setting in current implementation

Background

Investigation

Add entry point

Tasks

Tasks

Recommend Projects

Recommend Topics

Recommend Org